Revisiting Shallow Discourse Parsing in the PDTB-3: Handling Intra-sentential Implicits

In the PDTB-3, several thousand implicit discourse relations were newly annotated \textit{within} individual sentences, adding to the over 15,000 implicit relations annotated \textit{across} adjacent sentences in the PDTB-2. Given that the position of the arguments to these \textit{intra-sentential implicits} is no longer as well-defined as with \textit{inter-sentential implicits}, a discourse parser must identify both their location and their sense. That is the focus of the current work. The paper provides a comprehensive analysis of our results, showcasing model performance under different scenarios, pointing out limitations and noting future directions.


Introduction
Discourse parsing is the task of identifying and categorizing discourse relations between discourse segments in a given text. The task is considered to be important for downstream tasks such as question answering (Jansen et al., 2014), machine translation (Li et al., 2014), and text summarization (Cohan et al., 2018). There are various approaches to discourse parsing, corresponding to different views of (1) what constitutes the segments of discourse, (2) what structures can be built from such segments, and (3) what semantic and/or rhetorical relations can hold between such segments (Xue et al., 2015;Zeldes et al., 2019).
In the Penn Discourse Treebank (PDTB; Prasad et al., 2008), all discourse relations have two arguments, called Arg1 and Arg2. Discourse relations are termed explicit, if the evidence for the relation is an explicit discourse connective (word or phrase). For implicit discourse relations, evidence is in the form of argument adjacency (with or without intervening punctuation), though annotators were asked to record one or more discourse connectives that, if present, would explicitly signal the sense(s) they inferred to hold between the arguments. Where annotators felt that the relation was already signalled by an alternative (non-connective) expression, the expression was annotated as evidence for what was called an AltLex relation (Prasad et al., 2010).
The first major release of the PDTB was the PDTB-2 (Prasad et al., 2008) whose guidelines limited annotation to (a) Explicit relations lexicalized by discourse connectives, and (b) Implicit and Alt-Lex relations between paragraph-internal adjacent sentences and between complete clauses within sentences separated by colons or semi-colons. Since there were only ∼530 intra-sentential implicit relations among the ∼15,500 implicit relations annotated in the PDTB-2, they were ignored in work on discourse parsing (Lin et al., 2014;Wang and Lan, 2015;Xue et al., 2015Xue et al., , 2016, which took implicit relations to hold only between adjacent sentences. The situation changed with the release of the PDTB-3 (Webber et al., 2019). Among the ∼5.6K sentence-internal implicit relations annotated in the PDTB-3 are relations between VPs or clauses conjoined implicitly by punctuation (Ex. 1), between a free adjunct or free to-infinitive and its matrix clause (Ex. 2), and between a marked syntactic construction and its matrix clause. There are also implicit relations co-occurring with explicit relations (Webber et al., 2019), as noted in Section 3.1.
(1) Father McKenna moves through the house praying in Latin, (Implicit=and) urging the demon to split. [wsj_0413] to the left of the boundary as Arg1 and material to the right as Arg2. Secondly, Arg1 and Arg2 can appear in either order: Arg1 before Arg2, as in Ex 1-2, or Arg2 before Arg1, as in Ex. 3. Parsing implicit intra-sentential relations therefore requires both locating and labelling their arguments, as well as identifying the sense(s) in which they are related. 1 (3) (Implicit=if it is) To slow the rise in total spending, it will be necessary to reduce per-capita use of services. [wsj_0314] This work takes up some of the challenges of parsing implicit intra-sentential discourse relations. Overall, it contributes: (1) a set of BERT-based models used as a pipeline for recognizing intrasentential implicit discourse relations as well as classifying their senses; (2) experimental evidence that these BERT-based models perform better than comparable LSTM-based models on the relation recognition task; (3) evidence that the use of parse tree features can improve model performance, as was earlier found useful in simply recognizing when a sentence contained at least one implicit intra-sentential relation (Liang et al., 2020).

Related Work
The focus of the current work is parsing implicit intra-sentential discourse relations in the framework of the PDTB-3. As most of the implicit relations in the PDTB-2 were inter-sentential (i.e., ∼95% of its 15.5K implicit relations), its intrasentential implicits were ignored in parser development. Nearly all recent work on recognizing inter-sentential implicits in the PDTB-2 used neural architectures. This included multi-level attention in work by Liu and Li (2016), multiple text representations in work by Bai and Zhao (2018), including character, subword, word, sentence, and sentence pair levels, to more fully capture the text. Dai and Huang (2018) introduced a paragraph-level neural architecture with a conditional random field (CRF, Lafferty et al., 2001) layer which models inter-dependencies of discourse units and predicts a sequence of discourse relations in a paragraph. Varia et al. (2019) introduced an approach to distill knowledge from word pairs for discourse relation with CNN by joint learning of implicit and explicit relations. Shi and Demberg (2019) discovered that BERT-based models, which were trained on the next sentence prediction task, benefited implicit inter-sentential discourse relation classification. Here we assess whether they also benefit classifying intra-sentential implicit relations.
Looking at implicit relations in the PDTB-3, Prasad et al. (2017) consider the difficulty in extending implicit relations to relations that cross paragraph boundaries. Kurfalı and Östling (2019) examine whether implicit relation annotation in the PDTB-3 can be used as a basis for learning to classify implicit relations in languages that lack discourse annotation. Kim et al. (2020) explored whether the PDTB-3 could be used to learn finegrained (Level-2) sense classification in general, while Liang et al. (2020) looked at whether separating inter-sentential implicits from intra-sentential implicits could improve their sense classification. They also took a first step towards recognizing what sentences contained intra-sentential implicit relations, finding this benefitted from the use of linearized parse tree features.
Outside the PDTB-3 framework, intra-sentential discourse relations are handled by (1) identifying discourse units (DUs), (2) attaching them to one another, and (3) associating the attachment with a coherence relation (Muller et al., 2012). One can therefore ask why we did not simply adopt this framework in the PDTB-3 and exploit the relatively good performance by systems in the DISRPT shared task on sentence-level discourse unit segmentation (Zeldes et al., 2019). There are two main reasons: First, DISRPT (and the approaches to discourse structure it covers) assumes that discourse segments cover a sentence with a non-overlapping partition. This is not the case with the PDTB, where the presence of overlapping segments (both within and across sentences) has been well documented (Lee et al., 2006). Second, discourse segments in these approaches are taken to correspond to syntactic units, which leads to both over-segmentation and under-segmentation in the PDTB-3. Of course, there are "work-arounds" for over-segmentation, such as RST's use of a SAME-SEGMENT relation (Mann and Thompson, 1988), and under-segmentation can be addressed through additional segmentation. However, we decided that starting from scratch would allow us to clearly identify the problems of parsing intra-sentential implicits, at which point, we could consider what we could adopt from work done on the DISRPT shared task on sentence-level discourse unit segmentation (Zeldes et al., 2019).

Methodology
Given an input sentence S represented as a sequence of tokens s 1 · · · s n , our aim is to identify the span of Arg1 and Arg2 if there exist an implicit discourse relation in that sentence and then to predict its corresponding sense relation. We treat the identification of argument spans as a sequence tagging problem, and the prediction of senses as a classification task. Thus, given S, our aim is to output both a tag sequence Y of length n and a sense label c for the identified relation. The generated tag sequence y 1 · · · y n contains token-level labels where y j ∈ {B-Arg1, B-Arg2, I-Arg1, I-Arg2, O} indicating whether the token belongs to Arg1, Arg2, or Other. We adopt the BIO format (Ramshaw and Marcus, 1999) since arguments (1) can span multiple tokens, (2) can occur in either order, (3) need not be adjacent, and (4) do not overlap. (Future work will address two additional properties of intra-sentential implicits: (1) as shown in Sec 3.1, a sentence can contain more than one such relation and (2) even though most arguments are continuous spans, 264 intra-sentential implicit relations (4.2%) have discontinuous spans.) This section describes the two parts of our approach. Section 3.1 describes the creation of two datasets based on the PDTB-3: . . M }} consisting of N and M input-output pairs, where S is the input sentence, P is its parse tree, A 1 and A 2 are Arg1 and Arg2 of the intra-sentential relation, Y is the output sentence label, and c is the sense label. Sections 3.2-3.3 describe models to recognize the argument spans and classify the relations. We provide detailed descriptions for these two steps in the rest of this section.

Dataset Generation
As we have two tasks, we built two datasets. Dataset D 1 is used to train our argument identification models. It simply comprises individual sentences from the PDTB-3 and a sequence of labels of these sentences. Some models also contain parse tree features (Marcus et al., 1993). To generate the sequence of labels Y , we take annotations of intra-sentential implicit relations from the PDTB-3.
Arg1 tokens in the sentence are labelled Arg1, and Arg2 tokens are labelled Arg2. Tokens with O labels means they belong to neither Arg1 nor Arg2. If a sentence does not have any intra-sentential implicit relation, then all of its tokens will be labelled O. In BIO format, these labels then become B-Arg1, B-Arg2, I-Arg1, I-Arg2, and O.  The dataset comprises the 46,430 sentences in the PDTB-3, with 24,369 intra-sentential relations, of which 6,234 are implicit. Table 1 shows that a single sentence can have zero, one or more intrasentential implicit relations, with over 99% of the sentences having no more than one. So as not to lose any training data, the 321 sentences with two relations are duplicated, with each duplicate containing one of the relations. So while we do not currently try to learn multiple implicit relations within a sentence, this approach means we don't prejudice which relation is learned.

Number of relations
Another way of not losing data is to treat intrasentential AltLex relations as intra-sentential implicits because they only differ from the latter in signalling its sense in its lexicalization (Liang et al., 2020). For instance, free-adjuncts are generally Arg2 of an intra-sentential implicit. However, those free adjuncts headed by "avoiding", "contributing to", "resulting in", etc. are labelled Alt-Lex relations because the head uniquely signals a RESULT sense. Structurally, however, it is still an intra-sentential implicit.
Finally, the dataset does not include implicit relations that are linked to an explicit relation. Such linking is used to convey that the arguments are semantically related in a way that cannot be attributed to the explicit discourse connective alone (Webber et al., 2019). A dedicated model to recognize implicits "linked" to explicit relations is included in work by Liang et al. (2020).
Dataset D 2 is used to train the sense classi-...

CRF Layer
Bi-LSTM or BERT Labels stare straight ahead , fier. It only contains data on sentences with intrasentential implicits, and is thus smaller than D 1 . Each entry in D 2 includes Arg1, Arg2, the parse tree for the sentence in which they lie, and a sense label for the Arg1-Arg2 pair. Similar to D 1 , we also include AltLex relations. The current effort uses Level-2 sense labels to avoid data sparsity, while still providing a more meaningful sense than the 4 coarse labels in Level-1. The distribution of Level-2 sense labels is shown in Table 7 in Appendix A.1.

Argument Identification
The architecture for our argument identification model is shown in Figure 1. The input sentence first goes through the word embedding layer and then passes through either a BiLSTM (Hochreiter and Schmidhuber, 1997) or BERT module. Then the learned representation over the input sentence is fed into a CRF layer to generate a sequence of labels Y .
Baseline model The baseline model uses pretrained GloVe (Pennington et al., 2014) vectors with BiLSTM and no additional parse tree features. For input sentence S where S = {s 1 , . . . , s n } and s i denotes the ith token in S, the word vector e i is obtained from the word embedding module. Then a contextualized token-level encoding h i is obtained via a BiLSTM module: where − → h i and ← − h i are hidden states of forward and backward LSTMs at time step i, and ; denotes concatenation.
The resulting contextual word representations are then fed to the CRF layer to predict the Y label. (2019) observed that BERT-based models can benefit the task of classifying implicit inter-sentential discourse relations. Here we ask whether they can help in the task of argument recognition for intrasentential implicit relations by creating variants of the baseline model where some parts of the model are replaced by BERT-based models.

BERT-based models Shi and Demberg
The first variant of the model we implemented replaces the pre-trained GloVe word vectors with a pre-trained BERT model for word embedding initialization. We then construct a second variant on top of this, which also contains parse tree features. (These come from Penn TreeBank parse trees (Marcus et al., 1993), since we already know from Liang et al. (2020) that performance will drop when automated parse tree features are used.) The parse trees are first linearized and then fed to a separate BiLSTM module. The learned parse tree representations are concatenated with learned representations of the input sentence to a single vector. This vector, containing both lexical and syntactic information of the input, is then fed to the CRF layer for output prediction.
These model variants use a pre-trained BERT model. We also implemented models that fine-tune BERT on our task. One variant uses the vanilla BERT model, replacing the BiLSTM module, and another variant has the same architecture but also uses parse tree features of the input sentences.

Sense Classification
The sense classifier uses a BERT model whose input is the pair of arguments, Arg1 and Arg2, and whose output is a Level-2 discourse relation sense.
The model architecture is illustrated in Figure 2. Following Shi and Demberg (2019)

Training and Inference
Given the training set for argument identification with labelled sequence . . N }}, we maximize the conditional log likelihood for the sequence tagging objective: where w denotes the model's parameters including the weights of the LSTM/BERT module and the transition weights of the CRF layer. The loss function for Y labels is the negative log-likelihood based on Y (i) = {y 1 , . . . , y n }: For sense classification which is predicting c label, the loss function is cross entropy: where C denotes the number of classes.
At test time, inference for labels of a sentence S involves applying the Viterbi algorithm to the CRF module to find the sequence with maximum likelihoodŶ :

Experiments
For our experiments with LSTM-based models, we set the hidden dimensions to 256, the word embeddings to 100, and the vocabulary size to 50K. The word embeddings are initialized either using pre-trained Glove vectors (6B tokens, uncased) or a pre-trained base-uncased BERT. For experiments with BERT-based models, we use the same configuration with base-uncased BERT from Devlin et al.
(2019). All of our training used the Adam optimizer (Kingma and Ba, 2015). LSTM-based models used a learning rate of 1e-3 and BERT-based models used a learning rate of 5e-5. We also use gradient clipping with a maximum gradient norm of 1 and we do not use any form of regularization. For sense classification, the gradient clipping is set with a maximum gradient norm of 0.5. We carry out assessment on the datasets mentioned in Section 3.1 in two different ways: (1) using a random split of each dataset into training (60%), development (20%) and test (20%) subsets, and (2) accepting the argument in Shi and Demberg (2017) that there is too much variation across the corpus for a single random split to produce representative results and that N-fold cross-validation will deliver more reliable and predictive results. In this work, we perform 10-fold cross-validation. We use loss on the development set to perform early stopping. All of our models were trained on a single Tesla P100 GPU with a batch size of 32. Our model implementation uses PyTorch (Paszke et al., 2019). We used the BERT implementation from the Transformers library (Wolf et al., 2020), and the CRF implementation from AllenNLP library (Gardner et al., 2018) for constrained decoding for the BIO scheme.

Results
Argument Identification The best results here on the sequence labelling objective for all model variants come from the fine-tuned BERT-based model with additional parse tree features. In general, BERT-based models outperform LSTM-based models, and models that use parse tree features model Arg1 Arg2   The standard automatic evaluation metrics for discourse argument recognition are Precision, Recall and F 1 , computed on predicted arguments for relations that match the gold annotations. We follow Xue et al. (2015) in counting an argument as correctly recognized if and only if its span exactly matches the gold argument. No reward is given for partial match. Results on the test set from the 60/20/20 random split are shown for all models in Table 2. Cross-validation results are shown in Table 3 for BERT-based models. The best overall performance comes from the fine-tuned BERT model with parse tree features. The cross-validation results also show that the use of parse tree features improves Precision, if not Recall in all cases. This makes sense as the parse tree features can be used to reject what would otherwise be False Positives. In addition, we can see that for Arg2, the LSTM models with pre-trained BERT perform better than fine-tuned BERT. We haven't yet figured out any reason for this. Finally, we can see that models perform better on Arg2 (both Recall and Precision) than on Arg1. We attribute this to the fact that, even though Arg2 is not marked with an explicit connective (i.e., these are all implicit relations), there may still be positional and/or syntactic cues to its identity.
Supporting these observations are statistics from the test set. First, of its 987 sentences with intrasentential implicit relations, the leftmost argument aligns with the beginning of the sentence 495 times (50.2%). Of these, Arg1 is the leftmost argument 430 times (86.9%). At the other end of the sentence, 685 (69.4%) relations have their rightmost argument ending at the end of the sentence, almost 200 more than those with an argument at the beginning. Of these 685 relations, 614 (89.6%) have Arg2 at their rightmost boundary. Note that whenever an argument starts or ends at a sentence boundary, the model just needs to predict the other end of the span, which is easier than predicting both ends. With Arg2 appearing more often at a sentence boundary than Arg1, this is consistent with our observation of model performance.
Even though the training data for the argument identification model contains at most one relation per sentence and the model only predicts continuous argument spans, we purposely relax the constraint for the model to predict at most one relation per sentence. We find that for all test set predictions, our best model predicts 77 sentences with more than one Arg1 (7.53% of all sentences predicted to have a relation). It also predicts 44 sentences with more than one Arg2 (4.31% of all sentences predicted to have a relation). This probably arises from duplicating those training instances with more than one relation (cf. Section 3.1). Later in this section, we show how the sense classifier can be used model
As noted earlier, the order of arguments in an intra-sentential implicit relation is not fixed. So we have also correlated model performance with argument order. Of the 5,157 sentences overall with intra-sentential implicit relations, 4,665 (90.5%) have arguments in Arg1-Arg2 order. In the 60/20/20 split test set, this imbalance is higher across the 987 intra-sentential implicit relations, with 914 (92.6%) showing the more common Arg1-Arg2 order. Performance of the model in predicting argument order is shown in Table 4. Note that a match of argument order here does not imply exact match of arguments in the previous evaluation. However, all models are more accurate in predicting the more frequent Arg1-Arg2 order than Arg2-Arg1 order, although the performance difference is less with BERT-based models than with LSTM-based models, suggesting BERT can deal with under-represented data better.
For completeness, we also analyze the best model's argument identification performance under different conditions. First, we consider that part of the 60/20/20 split test set whose sentences contain more than one intra-sentential implicit relations. For a fair comparison, we ensure that all single-relation tokens derived from the same sentence are in the test set (36 tokens). We further divide these tokens into ones whose relation is further left in the sentence and ones whose relation is further right. The top part of Table 5 shows the performance of the fine-tuned BERT model with parse tree features on these sets. Compared with the original test set, the overall performance on relations derived from sentences with multiple relations is much worse. This is expected as the training set for argument identification assumed that a sentence can only have at most one intra-sentential implicit relation. We next examine performance for different sense labels, based on the four most frequent senses in the test set -those that appear more than 100 times. The bottom part of Table 5 shows that our model performed best on CONTINGENCY.CAUSE. Inspection of the True Positives shows this to result from Arg2 of these relations often being headed by a AltLex token with Part-of-speech tag VBG, which uniquely conveys the sense CONTIN-GENCY.CAUSE.RESULT. The model can thus easily learn to recognize these arguments using the parse tree features.
Sense Classification Cross-validation results for Precision, Recall, and F 1 for the top four senses recognized by our sense classifier are shown in Table 6. The complete performance breakdown is given in Table 14 in Appendix A.3, along with a confusion matrix (Figure 3) and results on the test set in the 60/20/20 random split (Table 13). As the distribution of senses is imbalanced, we calculate overall performance by averaging performance for each sense, weighted by its frequency in the test set. Note that the overall F 1 for intra-sentential implicits is much higher than 50.41 reported in Liang et al. (2020) where a LSTM-based model is used. The overall accuracy of our sense classifier on cross-validation is 69.54%, and that on the test set in the 60/20/20 random split is 75.19%. From the results, we can see that the performance for CONTINGENCY.PURPOSE is very high, with F 1 exceeding the weighted average by almost 20 points. We speculate that this is due to the fact that over 90% of CONTINGENCY.PURPOSE labels are on relations where Arg2 begins with a free "to clause" and there are few other intra-sentential implicits labelled CONTINGENCY.PURPOSE.
Our sense classifier is trained using gold argument spans taken from the PDTB-3. In reality, such gold annotations will rarely be available. Thus, we also use predicted arguments from our argument identification model as inputs to test the ability of condition Arg1 Arg2 # of rels  our sense classifier. Specifically, we first obtain test set sentences with intra-sentential implicit relations and their parse trees. Then we feed those to our argument identification model to get predicted arguments. These predicted arguments are then fed in to our sense classifier together with the parse tree features. If the argument identification model fails to predict any arguments for any sentences, we ignore them in our evaluation. Table 15 in Appendix A.3 shows the sense classification results using predicted arguments and, for comparison, the results using the gold arguments. Note that as we dropped some sentences, the results with gold arguments are slightly different from those of the original test set. Because the performance drop is small going from gold arguments to predicted arguments, we can argue that our models can be used as a pipeline for handling intra-sentential implicit relations in shallow discourse parsing with the input simply being a sentence. We noted in Section 3.1 that our argument identification model might predict multiple Arg1s and/or Arg2s for a given sentence. We therefore assessed whether the sense classifier could be used to decide which of the predicted arguments to use. Specifically, we identify cases where one argument has a single prediction but there are multiple predictions for the other argument. For each Arg1-Arg2 pair, we use the sense classifier to predict the sense label and its likelihood. Comparing these likelihoods, we select the pair with the highest certainty, ignoring cases where the predicted senses are the same for all pairs. We also implement a baseline of always choosing the pair with the most frequent sense, which is CONTINGENCY.CAUSE. The 60/20/20 split test set shows 34 cases in which there are multiple predictions of Arg1 for a single Arg2. In 23 out of these 34 instances, the correct Arg1 is associated with the pair with the highest likelihood. The baseline only gets 13 cases correct. Similarly, the test set shows 23 cases in which there are multiple predictions of Arg2 for a single Arg1. In 13 of these 23 instances, the pair for which the sense classifier assigns the highest priority correctly identifies which Arg2 to use. The baseline only gets 9 cases correct. While further analysis is needed, this does show that the sense classifier can contribute to selecting the right argument from the set of predicted argument candidates.

Conclusions and Future Work
To the best of our knowledge, this is the first work to attempt to identify the arguments of intrasentential implicit discourse relations in the framework of the PDTB-3, as well as their order and at least one of their sense relations. We used a model architecture similar to models of a sequence tag-ging task, and concluded that BERT-based models have better performance than LSTM-based models in both exactly matching gold annotations of arguments and correctly predicting the order of Arg1 and Arg2. We confirmed that using parse trees features as input to the model assists with these tasks. We also provide evidence that our sense classifier, together with the argument recognizer, can be used as a pipeline for handling intra-sentential implicit relations. We also find that the sense classifier can be used to aid the selection of the right argument from the set of predicted argument candidates.
Our methods have several limitations. First, we assumed every sentence can have at most one intrasentential implicit relation, whereas in reality, multiple such relations are possible (cf. Table 1). Secondly, our approach does not support the case of discontinuous argument spans. Thirdly, we have ignored implicit relations "linked" to explicit relations (either intra-sentential or inter-sentential). Finally, although we have followed the lead of Shi and Demberg (2017) in assessing performance using cross-validation because it is more reliable than choosing a specific test set, we did not cast all our results in those terms. In the future, we plan to address these problems and develop methods that can identify all existing relations within the sentence.

A.2 Additional Results for Argument Identification
In this section, we provide the results of the sequence tagging objective for all models variants as well as the results of argument order prediction. All results are reported on the test set. Table 8 shows results for our baseline LSTM model using GloVe word embeddings.

A.3 Additional Results for Sense Classification
In this section, we provide the full breakdown of performance of our sense classifier on each sense label. The results are provided in Table 13. Note that we only included senses existing in the test set. We provide a confusion matrix in Figure 3. In addition, we also provide results using 10-fold crossvalidation in Table 14. We also provide the full comparison for sense classification results using Label