An Empirical Study on Finding Spans

We present an empirical study on methods for span finding, the selection of consecutive tokens in text for some downstream tasks. We focus on approaches that can be employed in training end-to-end information extraction systems, and find there is no definitive solution without considering task properties, and provide our observations to help with future design choices: 1) a tagging approach often yields higher precision while span enumeration and boundary prediction provide higher recall; 2) span type information can benefit a boundary prediction approach; 3) additional contextualization does not help span finding in most cases.


Introduction
Various information extraction (IE) tasks require a span finding component, which either directly yields the output or serves as an essential component of downstream linking.In named entity recognition (NER), spans in text are detected and typed; coreference resolution (RE) requires mention spans; mention spans are linked when performing relation extraction (RE), and in event extraction this also requires detection of trigger spans.In extractive question answering (QA), a span in a passage is detected to be presented as the answer to a given question.
Following the proliferation of large pre-trained models (Peters et al., 2018;Devlin et al., 2019;Raffel et al., 2020, i.a.), recent approaches to span finding can be roughly divided into three different types: as tagging, span enumeration, or boundary prediction (see §2, Table 1).In this paper, we present an empirical study of these methods and their influence on downstream tasks, hoping to shed light on how to build future NLP systems.Specifically, we discuss the design choice of span finding methods and examine two common tricks for im-* Equal contribution.proving performance.We answer the following questions: 1. Q: What is the best span finding method?Does this hold for various NLP tasks or different pretrained encoders?
A: The choice depends on the downstream task: tagging generally has higher precision, but span enumeration or boundary prediction has higher recall.For most cases, boundary prediction is preferable to span enumeration.Tagging performs much better on a masked language model pretrained encoder (e.g., RoBERTa) than an encoder-decoder pretrained model (e.g., T5).
2. Q: Does inclusion of mention type information help (e.g., just B-PERSON or B in tagging)?
A: For downstream IE tasks, tagging and span enumeration approaches prefer untyped tags, but boundary prediction heavily relies on type information to obtain good performance.
3. Q: Is an additional contextualization with RNN layers on top of Transformers helpful?Some prior work put an LSTM layer on top of embeddings produced by Transformers (Straková et al., 2019;Shibuya and Hovy, 2020;Wang et al., 2021, i.a.).Are these necessary?
A: Not for RoBERTa (a pretrained encoderonly), but we observe a slight benefit from BiL-STM layers atop T5 (an encoder-decoder).

Background
We assume a pre-trained base model is in place (e.g.BERT (Devlin et al., 2019), T5 (Raffel et al., 2020)).The encoding for each token   is denoted as x  ∈ R  .For further analyses, we focus on RoBERTa (Liu et al., 2019) and T5 since these represent two classes of pretrained models: one with an encoder trained with reconstruction loss; the other with both an encoder and a decoder.
Tagging Span selection can be reduced to a sequence tagging problem, usually under the BIO scheme.Such tagging problems can be modeled using a linear-chain conditional random field (CRF; Lafferty et al., 2001), with tags either typed or untyped (for example, in NER tags may be labeled with entity types B-PERSON, I-LOCATION, or without, B, I).In recent work, features input to CRF may be hand-crafted features or predominantly outputs of a neural network.Note that reducing span finding to a tagging problem does not allow the system to produce overlapping spans.There exist methods to address these (e.g., for the task of nested NER), but we leave the analysis of these methods for future work.
Span Enumeration Lee et al. (2017) first proposed to enumerate all spans (up to length ) and predict whether they are entity mentions for coref-erence resolution.A span embedding s   is derived for each span  [ : ], usually a concatenation of the left and right boundary tokens x  , x  , a pooled version of all the tokens between   and   (usually an attention-weighted sum with a learned global query vector q (Lee et al., 2017;Lin and Ji, 2019)), and optionally some additional manual features : Such span embedding can be used for both span detection (untyped) and span typing: for detection, a span is given a score indicating whether it is a span of interest by applying a feedforward network  on top of the span embedding: For typing, one can create a classifier with the span embedding as input, and the set of types plus an  type (not a selected span) as the output label set.
Boundary Prediction BiDAF (Seo et al., 2017) introduced a method to select one span from a sequence of text that we term boundary prediction.This has been widely adopted following the proliferation of work based on pretrained models such as BERT for QA.
In boundary prediction, two vectors l, r over tokens are computed to indicate whether a token is a left or right boundary of a span: To determine the most likely span, one selects This method can be extended to the case where the model does not select any span (Devlin et al., 2019).A special [CLS] token may be prepended to the sequence of tokens, taking index 0, and the model is trained to select the placeholder span [0, 0] if no span should be selected.
Boundary prediction can also be extended to select more than one span.In CASREL (Wei et al., 2020), instead of selecting the most likely left and right indices, they select multiple left and right index if their score surpasses a threshold: Then a heuristic is used to match these candidate left/right boundaries to select spans.Li et al. (2020) extended the idea: instead of a heuristic, a model  (can be a 2-layer feedforward network, as is used in our experiments) is used to score all candidate spans selected by the threshold (up to length ): Since it is the most flexible and heuristic-free, we will focus on the last method in Li et al. (2020).To modify this span detector into a typed classifier, one can apply the same trick in span enumeration.

Experimental Setup
We perform all our experiments on RoBERTa base and the encoder part of T5 base (T5 enc base ).Each number reported is an average of 3 runs with different random seeds.Each run is trained with a single Quadro RTX 6000 GPU with 24GB memory.

NER, EE, and RE
We use the ACE 2005 dataset (Walker et al., 2006) to evaluate model performance on NER, EE, and RE tasks.We follow OneIE (Lin et al., 2020) to compile dataset splits and establish the baseline using their released codebase.
For comparable setups across tasks, we disable all the global features which involve complicated cross-subtask interactions and cross-instance interactions that are hard to adapt to other span finding methods.We also disable the additional biaffine entity classifier and event type classifier and use the typing from the span finding module directly in inference time for the experiments in Table 2.
For NER, in addition to the standard entity classification F1 (Ent-C), we also report entity identification F1 (Ent-I) to measure how models detect spans.For EE, we use the standard {trigger (Trig) / argument (Arg)}-{identification (I) / classification (C)} F1 scores.For RE, the standard F1 is used.
Coreference Resolution We use the higher-order coreference resolution model (Lee et al., 2018) as implemented in AllenNLP (Gardner et al., 2018) as the baseline for coreference resolution.
Extractive QA We use the Transformer QA model in BERT as the baseline.We evaluate on the dev set of SQuAD 2.0 (Rajpurkar et al., 2018), a large-scale reading comprehension dataset containing both answerable and unanswerable questions.We keep the first span in the input sequence for the questions with multiple answer spans and discard the others.We use exact match (EM) and token overlap F1 for QA.

Which Span Finder to Use?
From the results presented in Table 2, we find that boundary prediction is potentially preferable to span enumeration.Although span enumeration outperforms boundary prediction by a small margin in coreference resolution, boundary prediction outperforms span enumeration in other downstream tasks.While tagging and span enumeration suffer a considerable performance drop in all downstream tasks when using T5 enc base as encoder, boundary prediction only suffers a slight performance drop.
From the breakdown of the mention scores of each model on the OntoNotes dataset (coreference resolution) in Table 3, we can see that although span enumeration focuses significantly on recall, 1 boundary prediction can reach a comparable level of recall, whereas in a downstream task like QA where precision is needed, span enumeration cannot reach a comparable level of performance to boundary prediction.
The choice between tagging and boundary prediction depends on various factors, including but not limited to the language model, downstream task, training strategy, etc.Overall, tagging excels at precision; in contrast, boundary prediction and enumeration have better recall. 1 In the evaluation of coreference resolution, singleton mention clusters (i.e., clusters that have only 1 mention) are ignored in computing the evaluation scores.This practice weeds out lots of spans that should not be selected as mentions.This is the reason that a coreference model can achieve stateof-the-art results with low mention detection precision.

Does Typing Help Span Finders?
As can be seen from the results in Table 4, tagging with untyped labels outperforms tagging with typed labels in all tasks except trigger classification; the margin is even more significant for the tasks of event argument extraction and relation extraction.We hypothesize that under joint training, the types in the labels might hinder the model from learning other objectives.Therefore, if it is not necessary to have typed tags, it is recommended to use plain BIO labels for tagging.
As for boundary prediction, we found that performing classification with types is crucial in the joint training with downstream tasks.This is possibly due to the nature of the two-step process of the method, where the first step can be seen as a coarse classifier to select potential mention candidates while the second step double-checks such candidates (or classifies candidates with more finegrained types).However in span enumeration we observed that it is not much impacted by the inclusion of types.We hypothesize this results from label imbalance.Under span enumeration, there are a considerable number of spans that should not be labelled as valid mention spans.When downstream tasks further require classifying spans to more fine-grained types, the label distribution would be seriously imbalanced and dominated by the  (null type) label, making the learning ineffective.

Additional Contextualization?
We further examine a commonly seen practice of having an additional contextualization with RNN layers atop Transformer encoders.Following (Lee et al., 2017) as implemented in AllenNLP (Gardner et al., 2018), we use a 1-layer BiLSTM with hidden dimension of 200 for each direction.
We stack the BiLSTM contextual layer atop the Transformer encoders and report the experimental results in Table 5.We observe that, for IE tasks, adding additional contextualization does not affect model performance; for extractive QA, it improves model performance.Also, when using encoder-decoder architecture models (e.g., T5), the additional contextualization would lead to higher variance in downstream tasks compared to using encoder-only models (e.g., BERT).We hypothesize that the difference might come from the underlying architecture in T5, in which the encoder learns to specialize its representation to support the decoder.Therefore, when the encoder is being used alone, an additional contextualization might serve a similar purpose as a decoder and have to learn to utilize the representation to some extent.Even with such exceptions, from the design choice perspective, the merit of this trick is limited as it is not helpful in most cases while introducing training variance and additional parameters to the model.

Conclusions
We identified and investigated three common span finding methods in the NLP community: tagging, span enumeration, and boundary prediction.Through extensive experiments, we found that there is not a single recipe that is best for all scenarios.The design choices on downstream tasks rely on specific task properties, specifically the tradeoff between precision and recall.We suggest that precision-focused tasks consider tagging or boundary prediction, and recall-focused (such as coreference resolution) tasks consider span enumeration or boundary prediction.
We further examined two commonly used tricks to improve span finding performance, i.e., adding span type information during the training and adding additional contextualization with an RNN on top.We observed that boundary prediction on IE tasks heavily relies on type information, and adding additional contextualization mostly does not help span finding.
Architectures will continue to evolve and models will continue to grow in size, which may lead to different conclusions on the relative benefits of approaches.Still, the fundamental task of isolating informative spans of text will remain.We hope this study helps inform system designers today with existing models and still in the future as a starting point for further inquiry.

Limitations
As an empirical study, we provide observations under different combination of design choices for building an end-to-end trained IE systems with a span finding component.We hope such observations could provide insights for future work, but we have to admit that we are also bounded by the limitations of empirical studies, and that a theoretical analysis is out of scope of this paper.As a result, we only make our claims based on the experiment results with the baseline models that we evaluated, and hope that it could generalize well to other architectures.

Figure 1 :
Figure1: Three common methods for span finding, detecting the mention spans North Korea and nuclear weapons in the sentence." " denotes a candidate span that is too long to be considered.

Table 2 :
Basic experimental results on downstream tasks that involve mention detection.We report the results of entity extraction and event extraction from ACE05-E + dataset (F-score, %), relation extraction from ACE05-R dataset (F-score, %), coreference resolution from OntoNotes dataset, and QA task from SQuAD 2.0 dataset.

Table 3 :
Breakdown of mention score of each span finding method on OntoNotes dataset.We present the results from the models using RoBERTa base encoder.

Table 4 :
Experiment results of typing on IE tasks.Positive impact on model performance is shown in green while negative in red.

Table 5 :
Experiment results adding BiLSTM contextual layer to our baseline models.The table shows the performance gaps compared to counterparts in Table2that do not have an additional contextualization.Positive impact on model performance is shown in green while negative in red.