Sentence Identification with BOS and EOS Label Combinations

The sentence is a fundamental unit in many NLP applications. Sentence segmentation is widely used as the first preprocessing task, where an input text is split into consecutive sentences considering the end of the sentence (EOS) as their boundaries. This task formulation relies on a strong assumption that the input text consists only of sentences, or what we call the sentential units (SUs). However, real-world texts often contain non-sentential units (NSUs) such as metadata, sentence fragments, nonlinguistic markers, etc. which are unreasonable or undesirable to be treated as a part of an SU. To tackle this issue, we formulate a novel task of sentence identification, where the goal is to identify SUs while excluding NSUs in a given text. To conduct sentence identification, we propose a simple yet effective method which combines the beginning of the sentence (BOS) and EOS labels to determine the most probable SUs and NSUs based on dynamic programming. To evaluate this task, we design an automatic, language-independent procedure to convert the Universal Dependencies corpora into sentence identification benchmarks. Finally, our experiments on the sentence identification task demonstrate that our proposed method generally outperforms sentence segmentation baselines which only utilize EOS labels.


Introduction
The sentence, which we refer to as the sentential unit (SU), is a fundamental unit of processing in many NLP applications including syntactic parsing (Dozat and Manning, 2017), semantic parsing (Dozat and Manning, 2018), and machine translation (Liu et al., 2020).Existing works mostly rely on sentence segmentation (a.k.a.sentence boundary detection) as the first preprocessing task, where we predict the end of the sentence (EOS) to split a text into consecutive SUs (Kiss and Strunk, 2006;Gillick, 2009).This approach relies on a strong assumption that the text only consists of SUs; however, real-world texts like web contents often contain non-sentential units (NSUs) such as the metadata of attachments embedded in the email body, repetition of symbols for separating texts, irregular series of nouns, etc. (just to name a few).Such NSUs may cause detrimental or unexpected results in the downstream tasks if considered as parts of the SUs and are more desirable to be distinguished from SUs in the first preprocessing step.
To tackle this problem, we formulate a novel task of sentence identification, where the goal is to identify SUs while excluding NSUs in a given text ( §3).This can be regarded as an SU span extraction task, where each SU span is represented by the beginning of the sentence (BOS) and the EOS labels. 1 We illustrate the difference between sentence segmentation and sentence identification in Table 1.In sentence segmentation, the text fragment of an embedded file ("-TEXT.htm<< File: TEXT.htm >>") needs to be considered as a part of an SU.In contrast, sentence identification can regard it as an NSU and exclude it for downstream applications such as dependency parsing.
To conduct sentence identification, we propose a simple method which effectively combines the BOS and EOS probabilities to determine both SUs and NSUs ( §4).To be specific, we first train the BOS and EOS labeling models based on either the sentence identification dataset (with SUs and NSUs) or sentence segmentation dataset (only SUs).Then, we search for the most probable spans of SUs and NSUs using a simple dynamic programming framework.Theoretically, our method can be considered as a natural generalization of existing sentence segmentation algorithms.
To evaluate this task, we design an automatic pro- Table 1: Illustration of sentence segmentation and sentence identification.In sentence segmentation, EOS labels (E) are used to segment the input text into consecutive SUs (in blue).In sentence identification, only the spans bracketed by the BOS (B) and EOS labels are extracted as SUs, while the rest can be excluded as NSUs.
cedure to convert the Universal Dependencies (UD) corpora (de Marneffe et al., 2021) into sentence identification benchmarks ( §5).To be specific, (i) we use the original sentence boundaries in UD as the unit (SU and NSU) boundaries and (ii) classify each unit as an SU iff it contains at least one clausal predicate with a core/non-core argument.Importantly, our classification rule follows the definition of lexical sentence in linguistics (Nunberg, 1990), is easily customizable with language-independent rules, and makes reasonable classification within the scope of our experiments.
To conduct our experiments, we focus on the English Web Treebank (Silveira et al., 2014) as the primary benchmark for sentence identification and train the BOS/EOS labeling models by finetuning RoBERTa (Liu et al., 2019) ( §6).We also propose techniques to develop these models using a standard sentence segmentation dataset, i.e. the Wall Street Journal corpus (Marcus et al., 1993), which only contains clean, edited SUs without any NSUs.
Based on our experimental results, we demonstrate that our proposed method generally outperforms sentence segmentation baselines which only utilize EOS labels ( §7).These results highlight the importance of combining the BOS labels in addition to the EOS labels for accurate sentence identification under various conditions.

Background
Sentence segmentation, a.k.a.sentence boundary detection, is the task of segmenting an input text into the unit of sentences.Despite the long history of study (Riley, 1989) and its importance in the entire NLP pipeline (Walker et al., 2001), this area has received relatively little attention.For one reason, the task has been recognized as "long solved" (Read et al., 2012) with the most recent approach reporting 99.8% F1 score on the standard English Wall Street Journal (WSJ) dataset (Wicks and Post, 2021).Their state-of-the-art method ER-SATZ combines (i) a regular-expression based detector of candidate sentence boundaries, followed by (ii) a Transformer-based (Vaswani et al., 2017) binary classifier which predicts whether the candidate boundary is EOS based on the local context, i.e. surrounding few words.This modern context-based approach has been shown to outperform competitive, widely used baselines such as SPLITTA (Gillick, 2009), PUNKT (Kiss and Strunk, 2006), and MOSES (Koehn et al., 2007).
However, two important aspects are not fully addressed in the current literature.First is the coverage of diverse domains, genres, and writing styles.Existing works (including Wicks and Post, 2021) focus on formal/edited text and assume the existence of sentence ending punctuations (e.g.full stops) at the sentence boundaries.However, social media texts often lack such punctuations and contain various types of non-linguistic noise, which can lead to a substantial degradation in the segmentation performance (Read et al., 2012;Rudrapal et al., 2015).Speech transcription texts also usually contain disfluent, ungrammatical, or fragmented structures and lack both punctuations and casing (Wang et al., 2019;Rehbein et al., 2020).Considering the amount of such informal or nonstandard texts in the real world, it is compelling to expand the capability of sentence segmentation beyond formal, standardized text.
The second aspect is the coverage of multiple languages.Different languages involve different complexities in sentence segmentation, e.g.Chinese requires the disambiguation of commas as the sentence ending punctuation (Xue and Yang, 2011) and Thai does not mark EOS with any type of punctations (Aroonmanakun et al., 2007;Zhou et al., 2016).To advance NLP from a multilingual perspective, it is crucial to develop and evaluate models in multiple languages: Wicks and Post (2021) make an important step in this direction, proposing a language-agnostic, unified sentence segmentation model covering a total of 87 languages.
Based on these observations, we first propose to extend the task of sentence segmentation to sentence identification, which expands the capability of sentence segmentation beyond formal, standardized text ( §3, §4).Secondly, we propose a crosslingual method of benchmarking sentence identification based on the UD corpora, considering every word or character as the candidate boundary to cover diverse domains, genres, and languages that lack sentence ending punctuations ( §5).Finally, we follow Wicks and Post (2021) to develop modern neural-based models that require no languagespecific engineering and can be developed for different languages in a unified manner ( §6).

Sentence Segmentation Task
First, we introduce a precise (re-)formulation of the sentence segmentation task.Let W = (w 0 , w 1 , ..., w N −1 ) represent the input text, where each w i denotes a word (but can also be a subword or character).We also define the text span W [i : j] = (w i , ..., w j−1 ), their concatenation Next, we introduce the SU probability p SU (W [i : j]) which corresponds to the probability of the text span W [i : j] being an SU.Based on this probability, the task of sentence segmentation can be formalized as searching for the boundaries B which maximize the following probability: 2 arg max The most standard approach is to define p SU (W [i : j]) based on a pretrained EOS labeling model, as we describe in §4.1.However, our (re-)formulation 2 M is a variable and need not be fixed during the search.
as Eq. ( 1) is more general and permits other definitions of SU probability as well.

Sentence Identification Task
In sentence identification, we consider the input text W can be segmented into consecutive, nonoverlapping units of SUs and NSUs.Hence, we regard B = (b 0 , b 1 , ..., b M ) as the unit (SU and NSU) boundaries and define the unit indicators A = (a 1 , a 2 , ..., a M ) for each unit as follows: (2) Note that this strictly generalizes the sentence segmentation task in Eq. (1), which is a special case where a i = 1, ∀a i ∈ A. Based on this task formulation, we discuss how we can define p SU (W [i : j]) and p NSU (W [i : j]) to derive our sentence identification algorithm in §4.2.

Sentence Segmentation Method
In the most standard approach, sentence segmentation employs an EOS labeling model p EOS to define the SU probability p SU in Eq. (1).To be specific, let p EOS (w i |W ; θ) denote the EOS labeling model, which computes the probability of w i being EOS in W (θ denotes the model parameters).Typically, it is straightforward to train this model in a supervised learning setup using a dataset annotated with gold EOS boundaries (Wicks and Post, 2021).For brevity, we use the notation p EOS (w i ) as a shorthand for p EOS (w i |W ; θ), i.e. we omit W and θ (unless required) in the rest of this paper.
Based on the pretrained model p EOS , we can define the SU probability as p SU (W [i : j]) = p EOS (w j−1 ) i≤k<j−1 (1 − p EOS (w k )), which requires the last word w j−1 to be EOS and all other words to be non-EOS.By substituting this definition, we can decompose Eq. (1) as follows: (1) = arg max (3) where B EOS = {b i − 1 | i ∈ (1, 2, ..., M )} represents all the EOS indices defined by B.
This is a trivial optimization problem where we can simply choose B EOS = {i ∈ (0, 1, ..., N − 1) | p EOS (w i ) ≥ 0.5} to maximize Eq. ( 3).This also shows that sentence segmentation can be conducted by predicting the EOS independently for each w i based on p EOS (w i ).In contrast, sentence identification involves a more complex optimization problem which we solve using dynamic programming ( §4.2).

Sentence Identification Method
We extend the method of sentence segmentation ( §4.1) to conduct sentence identification.To be specific, we employ pretrained BOS and EOS labeling models p BOS , p EOS to define the SU and NSU probabilities p SU , p NSU in Eq. (2).As a first step, we need to train the BOS and EOS labeling models: this can be conducted in a supervised manner using a dataset containing gold BOS and EOS labels, as we explain in §6.1.
Based on the pretrained BOS and EOS labeling models, we can define the SU and NSU probabilities as follows: In the SU probability p SU , the first word w i is required to be BOS, the last word w j−1 to be EOS, and all other words to be neither BOS nor EOS.Note that this definition of p SU is a natural generalization from §4.1 which only relies on the EOS probability p EOS .
In contrast, the NSU probability p NSU requires all words to be neither BOS nor EOS.Notably, this definition does not distinguish contiguous NSUs in the sense that p NSU (W This is convenient as we are only interested in the extraction of SUs and do not need to seek the exact boundaries between consecutive NSUs. By substituting these definitions of p SU and p NSU , we can decompose Eq. (2) as follows: (2) = arg max Therefore, our goal is to choose B A BOS and B A EOS which maximize Eq. ( 4).To this end, we need to consider the restrictions that (i) the first label should be BOS, (ii) the last label should be EOS, and (iii) BOS and EOS labels need to appear alternately.These restrictions can be incorporated in our dynamic programming framework to find the argmax of Eq. ( 4).For the precise algorithm, we refer the readers to Appendix A.

Evaluation
Due to the novelty of the task, currently there exists no benchmark for evaluating sentence identification.To address this issue, we propose a fully automatic procedure to convert the Universal Dependencies (UD) corpora (de Marneffe et al., 2021) into sentence identification benchmarks.
Concretely speaking, we conduct the following two steps based on the gold UD annotation: (i) the detection of unit (SU and NSU) boundaries and (ii) the classification of each unit into SU or NSU.As for (i), we simply use the original sentence boundaries in the UD annotation, where UD uses the term sentence in a broader sense including both SUs and NSUs (e.g.sentence fragments).Note that the exact boundaries between consecutive NSUs (which we call NSU-NSU boundaries) do not need to be accurate or consistent, since we are only interested in extracting the spans of SUs.However, we do expect that the original boundaries are generally reliable in all other cases (SU-SU and SU-NSU boundaries), which seems to be the case.
The main problem is (ii), i.e. how to classify each unit as an SU or NSU.To this end, we follow the notion of lexical sentence in linguistics (Nunberg, 1990) which defines an SU based on the dependencies among the lexical items, e.g. a group of words that contain a subject and predicate.In this work, we build upon the UD dependency relations and define an SU as a unit that contains at least one clausal predicate with a core or non-core argument.3Here, a clause expresses an event or proposition which we regard as an essential aspect of SUs.A clausal predicate and a core argument form the backbone of a clause, while a non-core argument modifies it (de Marneffe et al., 2021).Note that our current definition excludes noun phrases appearing by themselves, since they only consist of the nominal dependent relations.However, we can flexibly customize the definition of SUs to include or exclude such phrases.
Due to the reliance on UD, our conversion procedure can be applied to a wide variety of languages supported in UD (currently over 100 languages).However, as a first set of experiments, we focus on the English Web Treebank (EWT) (Silveira et al., 2014) as the primary benchmark of sentence identification.This dataset comprises five genres of web media texts: namely weblogs, newsgroup threads, emails, product reviews, and Q&A websites.Consequently, the dataset contains formal SUs, informal SUs (e.g.without capitalization or punctuations) as well as a wide variety of NSUs.
We show some examples of NSUs in Finally, we summarize the dataset statistics of EWT in Table 3. Overall, 17∼28% of the units were classified as NSUs, with the test set containing the highest proportion of NSUs.We also regard SU extraction as a word-level or character-level BIO labeling task and report the number of gold BIO labels in Table 3. 4 At the word-level, we can see that the proportion of O-labels (indicating NSUs) is only 4∼8% and much smaller than the proportion of NSUs in terms of units: this is because NSUs are usually short and contain only a few words.At the character-level, the proportion of O-labels is slightly larger (6∼13%): this is because NSUs often contain extraordinarily long words like URLs and long sequences of nonlinguistic symbols.
Overall, we could verify that there exists a nonnegligible amount of NSUs in the EWT dataset, which we aim to exclude with sentence identification in our experiments.
6 Experimental Setup

Model Setup
As we discussed in §4.2, our sentence identification method requires pretrained BOS and EOS labeling models to identify SUs and NSUs.To develop these models, we simply finetune RoBERTa BASE by adding a binary BOS/EOS classifier on top of the encoder.
To enable our models to handle various lengths of the input texts, we concatenate the consecutive L units of gold SUs and NSUs as the input during training, where L is sampled from a geometric distribution with parameter p CC . 5However, the RoBERTa encoder has the restriction that the input text size cannot exceed 512 subwords.Therefore, if the input text size is too large, we replace L with the maximum L ′ < L which satisfies this restriction.Note that this is a common procedure to sample variable (instead of fixed) lengths of concatenated units (Joshi et al., 2020).
Assuming the existence of the in-domain sentence identification dataset (EWT Train/Dev), it is straightforward to train the BOS/EOS labeling models based on our unit concatenation procedure.However, we may not always have the gold annotation of SUs and NSUs for the target domain.To take such cases into account, we also consider a setup where we only have the standard sentence segmentation dataset (WSJ Train/Dev) to train the BOS/EOS labeling models.
When using the sentence segmentation dataset (WSJ), we need to apply the unit concatenation procedure using only clean, edited SUs.Unfortunately, this can yield the following data priors which do not actually hold in a sentence identification dataset (EWT): (i) an SU (almost) always starts with a capitalization and ends with punctuation, (ii) the first word of the input is always BOS and the last word is always EOS, and (iii) BOS always directly follows EOS.
To address (i) and (ii), we propose a simple data augmentation technique to alleviate the discrepancy in the data priors.To address (iii), we propose an ensembling technique with the unidirectional (instead of bidirectional) models which are agnostic to this data prior.

Data Augmentation (+AUG)
To address (i), we conduct a unit-level data augmentation, i.e. we modify each unit based on the following rules with a small probability p DA : • Convert all words in the unit to lower-case, upper-case, or title-case (e.g."hello world", 5 With parameter p CC ∈ (0, 1], the probability mass function of the geometric distribution is p(L = l) = (1 − p CC ) l−1 p CC where l ∈ {1, 2, 3, ...}.As p CC decreases, the distribution gets more skewed towards larger L. With p CC = 0, we consider p(L = ∞) = 1.

Orig.
B E B Joe went to school.After that he ...
Joe went to school AFTER THAT HE ...  In (i) unit-level augmentation, we randomly change the casing or remove the last punctuations of each unit.In (ii) unit truncation, we randomly truncate the first and last units of the input (and regard them as NSUs).
• Remove sentence ending punctuations based on a regular-expression matcher (following ERSATZ, Wicks and Post, 2021).
After the unit-level augmentation, we can apply the unit concatenation in the exact same manner.Finally, to address (ii), we randomly apply a unit truncation to the first and last units of the concatenated input.To be specific, we choose a random word in the first (last) unit and remove all words prior (posterior) to it with a small probability p T R .If the truncation is conducted, we regard the unit as an NSU and fix the gold BOS/EOS labels accordingly.See Table 4 for an illustration.
Based on this procedure, we can expect to alleviate the data priors (i) and (ii).For more details, we refer the readers to Appendix D.

Unidirectional Model (+UNI)
Simply concatenating SUs (without NSUs) yields the data prior (iii), i.e.BOS always directly follows EOS.This prior can be easily captured by the bidirectional models p BOS (w i |W ), p EOS (w i |W ) conditioned on the whole input W , including our RoBERTa-based models.For instance, as shown in Figure 1, the model may predict EOS at the end of the first unit (w 2 = #) just because the next word (w 3 = This) is likely predicted as BOS.
To alleviate this issue, we propose to combine the predictions of the unidirectional models for BOS and EOS labeling.To be precise, let W ≤i = (w 0 , ..., w i ) and W ≥i = (w i , ..., w N −1 ).Then, we can represent the unidirectional BOS model as p Uni BOS (w i |W ≥i ) (looking the context right-to-left) and EOS model as p Uni EOS (w i |W ≤i ) (looking left-toright).As illustrated in Figure 1, these models are agnostic to the data prior (iii).In practice, we can simply use different attention masks and share the encoder parameters (except the last classifier) for the unidirectional and bidirectional models.
We can utilize these unidirectional models by taking a linear intepolation with the bidirectional models as follows: Then, we can use p +Uni BOS and p +Uni EOS in place of p BOS and p EOS (respectively) to conduct sentence identification, as described in §4.2.
Finally, we compare our proposed methods against sentence segmentation baselines which only utilize EOS labels. 6As for the baselines, we use the EOS labeling model developed in the same manner to segment the input text based on EOS.Note that we can optionally force the last word in the input to be EOS: in this case, the result will only contain SUs since all segments will end with EOS.By default, we do not force the last EOS: in this case, the segment after the last EOS (if exists) is considered as an NSU.
As a default configuration, we use p CC = 0.5, p DA = 0.3, p T R = 0.1, and λ = 0.5 in our experiments.To ensure reproducibility, we report more details on the hyperparameters and model setup in Appendix D. For the precise procedure on how we convert between the word-, character-, and subword-level labels (for RoBERTa), we refer the readers to Appendix C.

Evaluation Setup
In the evaluation phase, we consider three ways of assembling the input texts on which we conduct sentence identification.Firstly, we can apply the same unit concatenation procedure as described in §6.1.To be specific, we use p CC = 0.5 (same as the training phase) and p CC = 0 (which concatenates the units up to the maximal length) to simulate both shorter and longer lengths of the input texts.However, this approach is relatively synthetic in the sense that we take the gold unit boundaries for granted.They are usually unavailable at the inference time, so we should consider a more realistic setting for evaluating the methods without relying on the gold unit boundaries.
To this end, we propose to evaluate sentence identification as a postprocessing of sentence segmentation.To be specific, we first apply the stateof-the-art method ERSATZ (Wicks and Post, 2021) on the raw text of EWT and then apply sentence identification to each segmented text.Note that ER-SATZ has high precision but still predicts false EOS which can fragment a gold SU: in such cases, we consider the fragmented SUs as NSUs and fix the labels accordingly (just as we did in unit truncation, cf.§6.1 and Table 4).
As for the evaluation metrics, we convert the predictions of our methods into word/characterlevel BIO labels (cf.Appendix C) and compute the F1 score for each label prediction.Then, we summarize the results as the macro average F1 and weighted average F1.We also compute the F1 score of the exact SU span extraction at the word/character-level. Finally, we run each experiment (from model training to testing) five times with different random seeds and report the average and standard deviation as the final results.

Results
Table 5 summarizes the word-level evaluation results.The results for the character-level evaluation show similar tendencies, so we put them in Appendix E. The F1 score for each BIO label prediction is also available in Appendix E.
Firstly, we take a look at the results when we have the in-domain sentence identification dataset (EWT Train/Dev) for model development.In this setup, we can verify that our proposed method (BOS&EOS) significantly outperforms the baselines (EOS-Only) in all metrics.For instance, our method achieves consistently high performance of 84∼89% F1 for the exact SU span extraction, both at the word-and character-level.This is a very promising result that demonstrates the effectiveness of our method when we can leverage the gold SUs and NSUs from the target domain.
Secondly, we focus on the results where we only utilize the standard sentence segmentation dataset (WSJ Train/Dev) for model development.
In this setup, we also report the results of applying our data augmentation (+AUG) and unidirectional model (+UNI) techniques from §6.1. 7ue to the data discrepancy between WSJ and EWT, we find a natural drop in performance compared to the previous setup using in-domain EWT Train/Dev.However, we can verify that our techniques (+AUG, +UNI) generally help to alleviate this issue, and our proposed method performs on par or slightly better than the EOS-only baselines when applying these techniques.It is especially worth noting the improvement in the exact SU span extraction task (reaching 64∼72% F1), where the advantage of our method is the most conspicuous and consistent in both word-and characterlevel evaluation.This improvement can also be explained by the higher performance in the B-label prediction with our method (Appendix E), which is a prerequisite for accurate SU span extraction.
Finally, we note that the EOS-only baseline without forcing the last EOS can be quite competitive with shorter inputs (p CC = 0.5 and postprocessing) but performs considerably worse when the input texts are longer (p CC = 0).This is because the baseline can only predict the last segment of the input as an NSU, which is less problematic when the input texts are shorter but becomes increasingly problematic with longer inputs (since most NSUs will not be able to be removed).In contrast, our proposed method performs much more robustly under various input lengths.
Through further experiments and analyses, we found that (i) the results are stable across different hyperparameter choices, (ii) predictions are reasonable especially when using the in-domain dataset (EWT Train/Dev) for model development, and (iii) our methods do not sacrifice performance on the formal/edited texts of the sentence segmentation dataset (WSJ Test).These detailed evidences can be found in Appendix F.

Conclusion
In this paper, we introduced a novel task of sentence identification, where we aim to identify SUs while excluding NSUs in a given text ( §3).Through sentence identification, we can clearly distinguish the portions of the text that are appropriate (or not) for the prediction and evaluation of sophisticated linguistic analyses, such as dependency parsing, semantic role labeling, etc.
To conduct sentence identification, we proposed a simple yet effective method of combining the BOS and EOS labeling models to determine the SUs and NSUs ( §4).To evaluate sentence identification, we designed an automatic, languageindependent procedure to convert the UD corpora into sentence identification benchmarks ( §5).
In our experiments, we developed the BOS/EOS labeling models by finetuning pretrained RoBERTa ( §6).Based on the experimental results, we showed that our proposed method combining the BOS and EOS labels outperforms sentence segmentation baselines which only utilize EOS labels in all of the considered settings ( §7).Overall, we expect sentence identification to be a fundamental framework for the preprocessing of noisy, informal, or non-standard texts in the real world.

Limitations
Firstly, our current experiments are limited to English and cover only five domains of web media texts in EWT.However, our task formulation ( §3), method ( §4), and evaluation framework ( §5) are fully agnostic to the language and domain.Hence it is straightforward to conduct experiments in different languages or domains (as long as they are supported in the UD).While we expect similar results with different languages/domains, we leave further investigation as a future work.
Secondly, while our method performs reliably when the in-domain dataset is available, there is still a huge room left for improvement without relying on such resources (e.g.only using the standard sentence segmentation dataset).To make our method fully practical, we still need to improve on the accuracy and robustness in such cross-domain scenarios.One potential approach is to refine the definitions of SU and NSU probabilities from §4.2 to make sentence identification more robust.For instance, we can incorporate span-level scores instead of only using word-level BOS/EOS probabilities to define the SU/NSU probabilities.We leave further improvement and extension of our approach as an important future work.
Finally, our methods are currently evaluated on the (exact) SU span extraction task.Ideally, we should also evaluate the methods on downstream applications such as POS tagging, syntactic parsing, semantic role labeling, etc.However, we still expect that the (exact) SU span extraction will play a primary role in the evaluation, since accurate (say human-level) identification of SUs/NSUs will likely provide unprecedented benefits on a wide variety of NLP applications dealing with real-world texts.While we leave the precise analyses on downstream applications as future work, our contributions make the first foundational step towards expanding the capability of the long-established sentence segmentation task.

A Dynamic Programming Algorithm
To find the maximum value (and the argmax) of Eq. 4 from §4.2, we rely on a simple dynamic programming framework.To be specific, we consider the partial labeling of BOS and EOS up to W ≤k = (w 0 , ..., w k ), where k ≤ N − 1.Then, we aim to compute the maximum log probability of Eq. 4 based on the partial labeling, i.e. using W ≤k in place of W .
Since the labeling is partial, W ≤k may end inside the SU (i.e. the last label is BOS) or outside the SU (i.e. the last label is EOS).Let log p IS (k+1) denote the maximum log probability when W ≤k ends inside the SU and log p OS (k + 1) the maximum log probability when W ≤k ends outside the SU.Then, we can initialize log p IS (0) = log 0 = −∞, log p OS (0) = log 1 = 0 (since we always start outside the SU) and iteratively update the two values as follows: (5) Note that we first update p IS (i) → p ′ IS (i) and p OS (i) → p ′ OS (i) based on the BOS probability p BOS (w i ).Then, we update p ′ IS (i) → p IS (i+1) and p ′ OS (i) → p OS (i+1) based on the EOS probability p EOS (w i ). 8 The iterative procedure is illustrated in Figure 2.
Finally, we can compute the log probability log p OS (N ) (since we always end outside the SU), which corresponds to the maximum value of Eq. 4.
To obtain the argmax, we can simply incorporate backtracking during the iterative updates of Eq. 5. Through this dynamic programming framework, we can ensure that the restrictions from §4.2 are satisfied: namely, (i) the first label should be BOS, (ii) the last label should be EOS, and (iii) BOS and EOS labels need to appear alternately.
In practice, we can limit the candidates of BOS indices to the subset where p BOS (w i ) is higher than a certain threshold c.This can be efficiently implemented by simply skipping the updates of p ′ IS (i) and p ′ OS (i), i.e. using p ′ IS (i) = p IS (i) and p ′ OS (i) = p OS (i), if p BOS (w i ) < c.9 Likewise, we can limit the candidates of EOS indices by skipping the updates of p IS (i + 1) and p OS (i + 1) if p EOS (w i ) < c.Generally speaking, this leads to a more efficient algorithm: therefore, we use the candidate threshold of c = 0.1 for restricting both BOS and EOS indices throughout our experiments.

B SU and NSU Examples
In Table 6, we provide more examples of SUs and NSUs identified based on our procedure described in §5.As for the SUs, we can verify that EWT contains clean, formal SUs with appropriate capitalization and punctuation.We can also verify that EWT contains various types of informal SUs, e.g. that lack capitalization/punctuation, use non-standard casing, end with emoticons, include spelling errors, concatenate consecutive SUs without a space, etc.

C Label Assignment and Conversion
In this section, we explain the precise procedure on how we (i) assign the gold character-level labels, (ii) convert the character-level labels to word/subword-level labels, and (iii) convert the subword-level labels to character/word-level labels.We limit our explanation to BIO labels, since it is straightforward to convert them to the combination of BOS and EOS labels (and vice versa).Firstly, we can assign the gold character-level labels from the UD annotation by taking the character-level alignment, which determines the exact spans of SUs and NSUs.From the characterlevel labels, we can assign the word-or subwordlevel labels based on the following rule: • If the word (or subword) contains a character with the B-label, assign it the B-label.
• Else if it contains a character with the I-label, assign the I-label.
• Otherwise assign the O-label.
For instance, this procedure is used to create the subword-level labels for training our BOS/EOS labeling models.
To evaluate our methods, we need to convert the subword-level labels produced by our methods into the character-level labels, which can then be converted into the word-level labels (based on the previous procedure).To convert a subword-level label into a sequence of character-level labels, we p BOS (wi) = 0 in Eq. 5. Unfortunately, Mr. Lay will be in San Jose, CA participating in a conference, where he is a speaker, on June 14."In 1972, there was an enormous glut of pilots," Campenni says.PS -There is a happy hour tonight at Scudeiros on Dallas Street (just west of the Met Garage) beginning around 5:00. 2) Your vet would not prescribe them if they didn't think it would be helpful.BUT EVERYONE HAS THERE OWN WAY!!!!!!The motel is very well maintained, and the managers are so accomodating, it's kind of like visiting family each year!;-) where can I find the best tours to the Mekong Delta at reasonable prices? it seems like its healthier too, but its prolly not.I have wifi at my house, but thats just at my house...is there anyway i can buy some card to make the ipod itself have wifi?NSUs -->===}*{===<---Lisa_coverletter.doc << File: Lisa_coverletter.doc>> Thur.Sept. 28 -Paris (Versailles or Fontainbleu -half day side trip) 9.3m -Number of US unemployed in April 2004.Game 1: Monday, May 28 @ 2:00PM vs. Los Angeles SPARKS Mixed Tempura.....................8.25 Shrimp or vegetable tempura & salad.Infinity stereo, bucket seats, nerf bars, tool box, bed liner, camper tow package, 5 speed manual.printing, printing, copies, printing, copies, printing, apply the following rule (where n denotes the number of characters in the subword): • If the subword has the B-label, the characterlevel labels are 1 B-label followed by n − 1 I-labels.
• If the subword has the I-label, the characterlevel labels are n I-labels.
• If the subword has the O-label, the characterlevel labels are n O-labels.

D Details on the Model Setup
As discussed in §6.1, we finetune the pretrained RoBERTa BASE publicly available on the Hugging-Face model hub 10 .We add a binary BOS/EOS classifier on top of the encoder, which is a singlelayer MLP with a hidden size of 768.We share the encoder parameters and use different classifiers for the BOS/EOS predictions.The BOS/EOS models are trained jointly by summing their losses.
10 https://huggingface.co/modelsWhen we combine the unidirectional models (+UNI), we take the same approach and use different classifiers for the unidirectional/bidirectional models.Again, the encoder parameters are shared and all models are trained jointly.
As for the training data preparation, we apply the unit concatenation and data augmentation (+AUG) on the fly, i.e. we see different concatenation and augmentation of the units in each iteration.The same procedure is applied on the validation set.
During data augmentation, we remove the last sentence ending punctuation based on the following regular-expression, similar to the candidate boundary detector in ERSATZ (Wicks and Post, 2021): • (. * P e P * ) where P denotes the set of punctuations and P e ⊂ P denotes the sentence ending punctuations.
Finally, all models are implemented in Pytorch and trained on a single Tesla V100-SXM2-32GB GPU.We use a batch size of 8, accumulate the gradients for 32 batches, and apply the gradient clipping at 1.0 before updating the model weights.As for the optimizer, we use Adam with the initial learning rate of 0.0001 and exponentially decay the learning rate by γ = 0.95 after each epoch.We check the validation loss every 200 batches and stop the training early if there is no improvement for 5 consecutive evaluations.

E The Full Experimental Results
In this section, we report the full results of our experiments which did not fit in §7.Table 7 shows the word-level F1 scores for each B-, I-, and Olabel prediction.Table 8 shows the overall results for the character-level evaluation.
Generally speaking, we can confirm the same results as observed in §7.Firstly, our proposed method significantly outperforms the baselines when we use the EWT Train/Dev dataset for model development.Secondly, our method performs slightly better than (or at least on par with) the baselines when developed on the WSJ Train/Dev dataset.Finally, the baseline without forcing the last EOS is competitive with shorter inputs (p CC = 0.5 and postprocessing) but performs considerably worse when the input texts are longer (p CC = 0).

F Further Experiments and Analyses
In this section, we provide further experiments and analyses to complement our study.To be specific, we provide discussions on the effect of the choice of hyperparameters (F.1), qualitative analyses based on example model outputs (F.2), and evaluation of sentence identification based on the sentence segmentation dataset (F.3).

F.1 Effect of Hyperparameters
As a default configuration, we used p DA = 0.3, p T R = 0.1 for the data augmentation (+AUG) and λ = 0.5 for the unidirectional model ensembling (+UNI).To examine the effect of the choice of these hyperparameters, we conducted further experiments by changing these default hyperparameters.Note that all evaluation results in this subsection are based on BOS&EOS (+UNI +AUG) developed on WSJ Train/Dev.
Firstly, we focus on the data augmentation and report the results of our method trained with different sets of p DA and p T R (with λ fixed at 0.5).Since increasing p DA leads to higher recall (and lower precision) of SU extraction and increasing p T R leads to higher precision (and lower recall), we used a fixed ratio of p DA : p T R = 3 : 1 which seemed to make a good trade-off.As shown in Table 9, the results are generally stable with the different choices of the hyperparameters.However, more data augmentation (with larger values of p DA and p T R ) tends to slightly improve the performance, especially for the exact SU span extraction.
Secondly, we focus on the unidirectional model ensembling and report the results of changing the linear interpolation rate λ ∈ [0, 1], where λ = 0 is equivalent to using only the bidirectional models and λ = 1 only the unidirectional models.We fix p DA = 0.3 and p T R = 0.1 and only change λ at the inference time without retraining the unidirectional or bidirectional models.As shown in Figure 3, we found that unidirectional and bidirectional models generally have complementary benefits, and choosing the intermediate value of λ leads to the best performance.The results also indicate that we may be able to obtain further improvement by tuning λ on the validation set, although we simply fixed λ = 0.5 throughout our experiments.

F.2 Qualitative Analyses
In Table 10 and 11, we show the actual predictions made by our proposed method developed on EWT Train/Dev and WSJ Train/Dev.For the latter, we applied +UNI and +AUG with the default hyperparameters.
In the first example (Table 10), we can verify that both models identify the correct SU span while removing the non-sentential header as the NSU.This is a relatively easy example, since the start of the SU is capitalized and less ambiguous.
In the second example (Table 11), we can observe that our method using in-domain data (EWT Train/Dev) extracts the correct SU span, while our method developed on out-of-domain data (WSJ Train/Dev) incorrectly excludes a part of an SU.This seems to be a relatively difficult example, since the start of the SU is not capitalized and more ambiguous.It is worth noting that such SUs can be reliably extracted when we can leverage the indomain annotation of gold SUs and NSUs.

F.3 Evaluation on the Sentence Segmentation Dataset
Finally, we report the results of sentence identification on the standard sentence segmentation dataset (WSJ Test).Table 8: Overall Results (Character-Level).We report the macro/weighted average F1 of the BIO labeling task and the F1 score of the exact SU span extraction task.
In Table 12, we summarize the WSJ dataset statistics.Note that WSJ only contains SUs and do not contain any NSUs (O-labels).However, we can still evaluate the performance using the same metrics, i.e. the macro/weighted average F1 of the BIO labeling task and the F1 of the exact SU span extraction task. 11 Table 13 summarizes the word-level evaluation results.Since we are evaluating on WSJ Test, the performance is naturally better when the models are trained on WSJ Train/Dev rather than EWT Train/Dev (which is now out-of-domain).
When the models are trained on EWT, we found that the baseline (EOS-Only) forcing the last EOS performs the best.This is natural, since this baseline better reflects the nature of the sentence segmentation dataset where all units are SUs.However, our method (BOS&EOS) is still comparable to this baseline and do not (or minimally) sacrifice performance on such datasets.
When the models are trained on WSJ, we found that our method without +UNI or +AUG performs 11 Since the O-label does not exist, we report the macro average F1 as the average F1 scores of the B-label and I-label predictions.
the best.This is most likely because we can leverage the knowledge of BOS to predict EOS.When we apply the data augmentation (+AUG) and unidirectional model ensembling (+UNI), we observe a slight decrease in performance compared to our vanilla method.However, the results are still comparable and even outperforms the baselines in some metrics (e.g. the exact SU span extraction task).
Overall, we can conclude that our methods do not sacrifice the performance on the the clean, edited texts of the sentence segmentation dataset.Table 10: Example Outputs (Both Correct).We show the predictions made by our proposed method (BOS&EOS) developed on EWT Train/Dev (top) or WSJ Train/Dev (bottom).We can verify that both methods identify the correct SU span while removing the non-sentential header as the NSU.

B
Developed with my breakfast I like bacon and sausage when I having a big breakfast like on EWT E a grand slam with pancakes and the works.

B
Developed with my breakfast I like bacon and sausage when I having a big breakfast like on WSJ E a grand slam with pancakes and the works.
and SU boundary indices B = (b 0 , b 1 , ..., b M ) where b 0 = 0, b M = N , and M i=1 W [b i−1 : b i ] = W (i.e. the concatenation of all SUs recovers the input text).
Joe went to school AFTER THAT HE ...

Figure 1 :
Figure 1: Illustration of the bidirectional EOS model (left) and the unidirectional EOS model (right).

Figure 2 :
Figure 2: Illustration of the dynamic programming procedure.
Thank you.-TEXT.htm<< File: TEXT.htm >> I was thinking of converting it to a hover (from EWT) vehicle.I might just sell the car and get you to drive me around all winter.Sentence Thank you.-TEXT.htm<< File: TEXT.htm >> I was thinking of converting it to a hover Segmentation E E vehicle.I might just sell the car and get you to drive me around all winter.

Table 2 :
Examples of gold NSUs in the English Web Treebank (EWT) identified based on our procedure.Each line corresponds to one example of NSU.

Table 2 (
and more in Appendix B) identified based on our

Table 3 :
EWT dataset statistics.procedure.As shown by the results, our procedure can identify various NSUs including nonlinguistic markers, timestamps, lists, contact information, etc.We can also see that noun/prepositional phrases are classified as NSUs based on our criteria.By excluding such NSUs and identifying SUs, we can clearly separate the portions of the text that are worth sophisticated linguistic analyses, e.g. based on dependency parsing or manual inspection.

Table 4 :
Illustration of our data augmentation technique.

Table 5 :
Overall Results (Word-Level).We report the macro/weighted average F1 of the BIO labeling task and the F1 score of the exact SU span extraction task.Details of our experimental setup are discussed in §6.

Table 6 :
Examples of gold SUs and NSUs in the English Web Treebank (EWT) identified based on our procedure ( §5).Each line corresponds to one example of SU or NSU.

Table 11 :
Example Output with One Incorrect Case.We show the predictions made by our proposed method (BOS&EOS) developed on EWT Train/Dev (top) or WSJ Train/Dev (bottom).We can verify that the former extracts the correct SU span, while the latter incorrectly excludes the first prepositional phrase as an NSU.