Proﬁling News Discourse Structure Using Explicit Subtopic Structures Guided Critics

We present an actor-critic framework to induce subtopical structures in a news article for news discourse proﬁling. The model uses multiple critics that act according to known subtopic structures while the actor aims to outperform them. The content structures constitute sentences that represent latent subtopic bound-aries. Then, we introduce a hierarchical neural network that uses the identiﬁed subtopic boundary sentences to model multi-level interaction between sentences, subtopics, and the document. Experimental results and analyses on the NewsDiscourse corpus show that the actor model learns to effectively segment a document into subtopics and improves the performance of the hierarchical model on the news discourse proﬁling task 1 . runs with random seeds. In addition, we report standard deviation for both macro and micro F1 scores. Statistical signiﬁcance tests show that both the macro and micro F1 scores for RL-IP/TT model are signiﬁcantly better compared to the hierarchical, self-critic, TextTiling and joint-IP models with p < 0.05 on paired t test (Dietterich, 1998). Similarly, the macro F1 scores for RL-TT and RL-IP models are signiﬁcantly better compared to the hierarchical, TextTiling and Joint-IP models with p < 0.05.


Introduction
News discourse profiling is a discourse processing task that aims to classify sentences in news articles into different content types, where each content type characterizes the specific discourse role of a sentence in describing a news story (Choubey et al., 2020). It is vital to effectively contextualize the occurrence of a news event, which has been shown useful for extracting event structures from a document (Choubey et al., 2020;Choubey and Huang, 2021). Furthermore, this task is likely to benefit a range of other NLP applications that require deep story-level text understanding, such as text summarization and complex question answering.
As the discourse roles are interpreted with respect to the main event, the current approach for discourse profiling uses a hierarchical neural network model (Choubey et al., 2020) that relies on a sentence-level encoder to obtain sentence embeddings, that heed to the local context, and a document embedding to obtain the underlying main 1 Code and data are available at https://github. com/prafulla77/Discoure_Profiling_RL_ EMNLP21Findings topic. The hierarchical model, intuitively, provides a mechanism for capturing both global and local dependencies among sentences and the main topic. However, the model is completely unaware of the underlying content organization structures that are used while producing news reports. Besides, squeezing document-level features into a single vector provides limited space to learn effective document representation and model its interaction with the sentences.
To extend the modeling capabilities and incorporate document-level content organization structures in the hierarchical model, we propose to decompose a document into latent subtopics by identifying subtopic boundary sentences, and model two levels of interactions-between sentences and a subtopic, and between subtopics and the document. We hypothesize that learning subtopic representations allows the model to focus on the locally relevant sentences, independent of main content, and identify its fine-grained discourse function within local subtopical context. Further interactions between subtopics and the document vector helps to determine the broader role of a subtopic with respect to the main content. For instance, in news document in Figure 1, we can identify discourse role of sentence S7 by combining the two levels of information. First, sentences S6 and S8 describe events that happened years before the main event which can be modelled through interaction between document and subtopic embedding corresponding to [S6-S8]. Then, events in S7 has temporal proximity with the events in sentences S6 and S8 which can be modelled through interaction between sentence S7 embedding and subtopic embedding.
Several past works have independently studied the subtopic structures, and a document can exhibit multiple subtopic structures depending on the used segmentation criteria. In this paper, we consider two subtopic structures: 1) broad-genre topic segments generated by the TextTiling algorithm; and Figure 1: An example document annotated with three different subtopic structures. The first is based on TextTiling (Hearst, 1997) and is shown with the black-solid line ([S1-S8],[S9-S11]). The second structure is based on locally inverted pyramid structure (discussed in § 5.2) and is shown through red-dashed lines ([S1-S5],[S6-S8],[S9-S11]). The third, shown by colored boxes, segments document based on the temporal position where the first segment (S1, S2) focuses on the main event, second segment (S3, S4, S5) describes events following the main event, third segment (S6, S7, S8) describes historical events and the last segment (S9, S10, S11) again covers current context.
2) news genre-specific inverted pyramid structure identified through a plausible rules-based surrogate (Section 5). Depending on the document, one of these subtopic structures may be more suitable for identifying discourse role labels. Also depending on the document, the most suitable subtopic structure may not be strictly the same as one of the known subtopic structures. For instance, in Figure 1, we show three subtopic structures, namely, TextTiling, inverted pyramid structure, and a new subtopic structure that is based on temporal frames. Here, three subtopic structures only partially overlap with each other. Notably, sentences in each of the segments obtained by considering temporal frames exhibit homogeneous discourse role labels and could be the most suitable structure to consider.
In this work, we limit ourselves to indirectly using known explicit subtopic structures as critics in a new variant of actor-critic model that selects between the standard REINFORCE (Williams, 1992) algorithm or imitation learning for training actor. Specifically, when subtopic structure identified by actor performs better than all known explicit subtopic structures, we baseline the standard REINFORCE (Williams, 1992) algorithm with the average of reward obtained by all explicit subtopic structures. On the other hand, if one of the explicit subtopic structures perform better than the actor, we force the actor to imitate that subtopic structure. Intuitively, this allows the actor model to learn to identify subtopic boundary sentences most suitable for a given document that perform better than or at least comparable to a known explicit subtopic structure on discourse profiling task.
Experimental results on the NewsDiscourse corpus show that modeling latent subtopic structures in a hierarchical discourse model improves its performance by 2.6 and 1.3 points on average macro and micro F1 scores respectively. The improvement is consistent to different model initialization and shows that modeling the underlying sequential content organization structure enables the system to better predict content types for individual sentences.

Related Work
Theory on News Content Organization The theoretical studies on the organization of different news elements (Van Dijk, 1985;Myers and Simms, 1989;Bell, 1998;Schokkenbroek, 1999;Ytreberg, 2001;Mani et al., 2005;Saleh, 2014) have been extensively explored through case-studies in journalism and discourse. For instance, Po¨ttker (2003) and Filak (2019) studied the widely used inverted pyramid structure that follows the standard relevance ordering. Specific to the content structures of hard-news reports, White (1997) observed that the generic hard-news exhibit non-linear structure where sub-components such as consequences, causes, contextualization or other supportive information possess orbital relationship with the main event. Further, these sub-components are organized around the main event with repetitions of the most newsworthy event, i.e. main content. This aligns with our inverted pyramid structure based on rule of relevance ordering, where a main sentence following non-main sentences indicates the segment boundaries. Besides, feature news also follow welldefined content organization structures such as having an introductory anecdotes or back-grounding which is taken care by our rules to separate historical and anecdotal contents from the relevance ordering.
Neural Models for Discourse Modeling Deep neural networks have been successfully explored for modeling discourse (Ji and Eisenstein, 2014a,b;Becker et al., 2017), including hierarchical models (Li et al., 2016b;Liu and Lapata, 2017;Dai and Huang, 2018) to induce hierarchical structure. Morey et al. (2017), however, found that some of the improvements from neural models on RST parsing are attributed to differences in evaluation procedures. Nonetheless, Morey et al. (2017) concluded that neural models are more effective in modeling discourse, though the relative error reduction rates are lower than reported. The better discourse modeling capabilities of neural models are also evident from their widespread adoption in follow-up works such as ; Zhang et al. (2020) and Koto et al. (2021).
For automatic text segmentation, multitude of approaches such as lexical overlap, bayesian learning or dynamic programming (Hearst, 1997;Choi, 2000;Utiyama and Isahara, 2001;Eisenstein and Barzilay, 2008;Du et al., 2013) have been proposed. The recent works rely on neural network models to learn different aspects of text segmentation such as coherence and cohesion (Wang et al., 2017;Sehikh et al., 2017;Bahdanau et al., 2016a;Arnold et al., 2019).

Reinforcement Learning for NLP Applications
Reinforcement learning has been frequently used for sequence generation tasks to mitigate exposure bias or to directly optimize task-specific evaluation metrics such as BLEU score (Ranzato et al., 2015;Henß et al., 2015;Bahdanau et al., 2016b;Paulus et al., 2017;Fedus et al., 2018). In addition, RL have been explored for range of NLP tasks such as question-answering , dialog generation (Li et al., 2016a), text summarization (Chen and Bansal, 2018), knowledge-graph reasoning (Lin et al., 2018) and relation extraction (Qin et al., 2018). To the best of our knowledge, we are the first to explore RL techniques for exposing underlying content organization structures in news articles as well as using a linguistically motivated critic to reduce the variance of reinforce algorithm.

Task Description
News discourse profiling categorizes sentences in news articles into eight schematic categories that are defined following the news content schemata proposed by Van Dijk (Teun A, 1986;Van Dijk, 1988a,b;Choubey et al., 2020). The eight content types describe the common discourse roles of sentences in telling a news story. Specifically, Main Event (M1) sentences introduce the main event relating to the major subjects of a news article. Consequence (M2) sentences describe consequence events immediately triggered by the main event. Previous Events (C1) sentences describe the recent events that act as possible causes or preconditions for the main event while Current Context (C2) sentences describe remaining context informing contents.
News articles may also describe past events that precede the main events in months and years (Historical Event (D1)) or unverifiable situations that are often fictional or personal accounts of incidents of an unknown person (Anecdotal Event (D2)). Lastly, opinionated contents including reactions from immediate participants, experts, known personalities as well as journalists or news sources are covered in the Evaluation (D3) category, except speculations and projected consequences that are labeled as Expectation (D4). n sentences with their content-type labels Y : {y 1 , y 2 , .., y n }, our main goal is to learn a model f : X → Y that classifies each sentence x i in the document X to its content type y i . In the first step, a latent function f T : X → T ∈ {1, 2, .., n} k , a classifier, is used to identify k subtopic boundary sentences in the document. These boundary sentences are used to partition documents into multiple subtopics. In the second step, a classification function f C : [X, T ] → Y combines the output of latent function f T with the sentences X in document to perform final content-type classification ( Figure 2). Overall, the model consists of a sentence encoder, a biLSTM-based (Hochreiter and Schmidhuber, 1997) hierarchical encoder to obtain contextualized sentence representations used by both f T and f C , and a pointer decoder network to select subtopic boundary sentences that is exclusively used by f T .

Hierarchical Sentence Encoder
We use a hierarchical encoder to learn contextaware sentence representations. Given a word sequence x i represented by {w i1 , w i2 , .., w im }, we first transform the sequence to contextualized word embeddings E i using the pre-trained ELMo (Peters et al., 2018). Then, we use a word-level biLSTM layer over E i to obtain hidden state representations H i and take their weighted aver-age to obtain the local sentence embedding S L . Weights for hidden states are obtained using a two-layered feed-forward neural network. Finally, we apply another sentence-level biLSTM over the sequence of headline and sentence embeddings {H L , S L 1 , S L 2 , .., S L n } to obtain the contextualized sentences representations S C that are later used in both the sub-modules f T and f C (eq. 1). (1)

Identifying Subtopic Boundary Sentences through Pointer Decoder Network
Given the sentence embeddings from hierarchical sentence encoders, we use an LSTM decoder based pointer network (Vinyals et al., 2015) to identify subtopic boundary sentences. We initialize decoder's hidden states with the document encoding (d h k=1 =D from eq. 3) and start decoding using embedding corresponding to the first sentence (S C 1 in eq. 1). At decoding step k, we calculate the subtopic boundary sentence probability following eq. 2, where T k−1 is the index of (k − 1) th boundary sentence.
The pointer-decoder network (eq. 2) along with contextualized sentence encoder (eq. 1) constitute our f T model. Note that we do not a priori know the number of subtopic boundary sentences in a document. Therefore, we append a special sentence "eod" at the end of each document, which when sampled as subtopic boundary sentence indicates the end of subtopic boundary sentence decoding.

Discourse Profiling
Given the list of subtopic boundary sentences T L and contextualized sentence representations [H C , S C 1 , .., S C n ], we use scalar soft-attentions (α s ) over sentence representation, as described in eq. 3, to learn local subtopic (T ) and global document (D) representations. Finally, we combine sentence, local subtopic and document representations through element-wise product and differences (u i ) and use a two-layered feed-forward neural network to predict the labels. The networks defined in eq. 3 together with the contextualized sentence encoding network in equation 1 make the discourse profiling network f C . (3)

Learning f T through Subtopic Structures-guided Critic
Our goal is to train the neural network-based subtopic boundary sentence scorer f T model using indirect supervision derived from the performance of f C on discourse profiling task. Intuitively, RE-INFORCE algorithm (Williams, 1992), that has shown success in a range of NLP tasks, offers a suitable mechanism to train our f T model. However, the vanilla reinforce is known to suffer from the problem of high variance. In addition, vanilla reinforce is incapable of inducing any known subtopic structure into the f T . Therefore, we propose a new variation to the actor-critic (Konda and Tsitsiklis, 2000) model which defines multiple critics, each using a known subtopic structure, and trains in either of the imitation or reinforcement learning mode, depending on the performance of f C classifier with known subtopic structures or the subtopic boundary sentences predicted by f T .
Specifically, we consider f T as the actor network that samples subtopic boundary sentences (T S ) following eq. 2. Then, we use sampled subtopic boundary sentences to partition the news document and use eq. 3 to identify content-typeŝ Y : {ŷ 1 , ..,ŷ n } for all the sentences. We calculate the average of micro and macro F1 scores of the predicted content typesŶ and use that as the reward R A for our actor network. Following the same steps with reference subtopic boundary sentences T j R that are derived from a known subtopic structures (j th ), we also obtain the reward R j C for all our critics. Next, if the actor's reward exceeds all the critics rewards (R A > R j C ∀j), we use the reinforcement learning formulation and train f T using the L RL loss in eq. 4. Alternatively, if actor's reward is lower than any of the critic, we use immitation learning with the cross-entropy loss (L IL ) based on the critic with maximum reward. The discourse profiling classifier f C is trained using the standard cross-entropy loss on discourse profiling task (L C ).
At every iteration, the RL loss term forces the f C model to perform at least as good as the model with a known subtopic structure and its reference subtopic boundary sentences T R that yield highest critic reward. The f C model thus converges to parameters that obtain higher reward than its counterpart with reference subtopic boundary sentences.
For the f T model, if it chooses good subtopic boundary sentences T S , that give higher reward than T R , it further increases the likelihood for T S . However, if it chooses bad T S , the imitation learning loss forces the model to mimic subtopic structure with highest reward.

Document-level Content Organization Structures
We experiment with two subtopic structures: 1) broad-genre topic segments and 2) news genrespecific inverted pyramid structure. Note that both topic segments and inverted pyramid structures are automatically identified through statistical model or a plausible rules-based surrogate, which may not perfectly represent subtopic boundaries.

TextTiling
TextTiling (Hearst, 1997) is a paragraph-level model of discourse structure based on the notion of subtopic shift. It uses lexical co-occurrence and distribution patterns to divide a document into a sequence of topically coherent segments. It is a widely used algorithm to find subtopics segments in text and presents an effective representation for document-level content organization structures. We use the implementation provided with nltk (Bird et al., 2009) in our experiments.

Locally Inverted Pyramid
We also consider the inverted pyramid structure (Po¨ttker, 2003), that is most often used in news media (Dai et al., 2018). It organizes the news content in decreasing order of relevance, placing the most relevant information at the top and then arranging the remaining details in an decreasing order of relevance. While the inverted pyramid is a global content organization structure, we made a simplifying assumption that a document consists of smaller sequences of segments that locally follow the inverted pyramid structure. We identify a sentence as representing a subtopic boundary if it breaks the non-increasing relevance order of preceding sentences, i.e. its relevance lies higher to its preceding sentences on the relevance scale. Given that the relevance order of sentences is not always aligned with their textual order, it provides an accessible proxy to define subtopical boundaries. Specifically, since the eight discourse content types align with the relevance order of content in a document, with main sentence (M 1) being the most relevant and central to the document, followed by immediate consequences (M 2), causes and general context (C1, C2) and then opinions and expectations (D3, D4), it allows us to use the content types of sentences to extract subtopic boundary sentences and partition a document into smaller subtopical segments. For instance, a main event sentence following context-informing or supportive contents will make the main event sentence a subtopic boundary sentence. With the above rationales, we first identify the first sentence of a document as a subtopic boundary sentence. Then, given a document and content labels (x i , y i ) ∈ X, we identify new subtopic boundary sentences x i following the rules defined in Algorithm 1. Note that we dissociated historical (D1) and anecdotal (D2) content types from the relevance ordering as they are frequently used to set the tone for a news article or to highlight main argument with personal experiences or historical events. We evaluate our content organization structureaware model on the NewsDiscourse Corpus 2 . It consists of 802 English news articles taken from three different news sources, NYT, Xinhua and Reuters, and covers business, crime, disaster and politics domains. Each sentence in a news document is annotated with one of the eight discourse content types (described in §3), and additionally speech and non-speech labels. 3 . We used 502 documents for training, 100 documents for validation 2 The dataset was obtained from https://github. com/prafulla77/Discourse_Profiling 3 Speech labels are not related to the discourse profiling structure in a news document. Therefore, in our experiments, we only focus on classifying a sentence into one of the eight discourse content types. and 200 documents for testing, following the standard splits provided with the dataset. Models are evaluated on F1 score for each content type as well as micro F1 and macro P/R/F1 scores using the implementation provided by the scikit-learn (Pedregosa et al., 2011) library.

Baseline Models
Hierarchical: uses the hierarchical neural network architecture as proposed by Choubey et al. (2020) to learn sentence and document encodings and model associations between each sentence and the document encoding.
Self-Critic: uses the output of f T network under the test-time inference algorithm (Rennie et al., 2017) to identify subtopic boundary sentences. Specifically, we take the argmax over the probability p(T k |T 1 , .., T k−1 ; H C , .., S C n ) (eq. 2) at k th decoding step to identify the k th subtopic boundary sentence. This model learns to build content structures entirely from the indirect supervision signal obtained from the f C model, average of micro and macro F1 scores on the discourse profiling task.

Subtopic Structure-aware Models
TextTiling: directly uses the output of TextTiling model to partition documents and is trained on the loss (L C ) for discourse profiling task alone. The model does not include pointer network for identifying subtopic boundaries. Besides, it uses all other neural components and is structurally similar to RL-based models.
Joint-IP: learns to jointly identify subtopic boundary sentences, defined with rules in §5.2 to induce local inverted pyramid structure, and predict content types. It replaces pointer network with a two layer feed-forward neural network to identify subtopic boundary sentences. The model is trained on average of L C and a binary cross entropy loss over subtopic boundary sentences.
RL-TT and RL-IP: use single critic defined through TextTiling and locally inverted pyramid subtopic structure respectively.
RL-IP/TT: uses two critics, first defined through TextTiling and second defined through locally inverted pyramid subtopic structure. has hidden dimension of 1024. All two-layered feed forward networks use 1024 hidden units, including all networks used for calculating scalar attention weights as wells as networks used to predict subtopic boundary sentences and content-type labels. All models use fixed word-embeddings and are trained using Adam (Kingma and Ba, 2014) optimizer with the learning rate of 5e-5 and dropout rate (Srivastava et al., 2014) of 0.4 on the output activations of both BiLSTMs and all neural layers. The models are trained for 15 epochs and we use the epoch yielding the best validation performance. Consistent with the experimental setup used by Choubey et al. (2020), we run each neural model ten times with random seeds and report the average performance. As reinforcement learning or neural networks in general are sensitive to random seeds, analyzing average results alleviates the influence of randomness and provides stable empirical results. Learning rate and dropout rate are identified using grid search. First, we search learning and dropout rates from [1e-3,5e-4,1e-4,5e-5] and [0.4, 0.5, 0.6] respectively using the hierarchical model. Then, both learning and dropout rates are kept constant for all models. Each training run takes ∼1200 seconds without any major increase in training time from the RL component. We use one document per training iteration. All experiments are performed on NVIDIA GTX 1080 Ti 11GB using PyTorch 1.2.0+cu92 (Paszke et al., 2019) and AllenNLP 0.8.3 (Gardner et al., 2017).

Results and Analysis
Tables 1 and 2 compare all models on the validation and test datasets respectively. First, on the validation dataset, the best performing self-critic, Text-Tiling, and joint-IP models perform similar to the hierarchical model. Only the joint-IP model could  obtain consistent improvements over the hierarchical model, improving macro and micro F1 scores by 0.75 and 0.41 points respectively. However, on the test dataset, based on the average performance of 10 runs, all three models outperform the hierarchical model. TextTiling and joint-IP models, that directly use explicit subtopic structures, yield 1.1 and 1.4 points improvement in the average macro F1 score respectively. The self-critic model that directly learns subtopic boundary sentences through reward defined using its performance on the discourse profiling task performs better than the models with explicit subtopic structures, outperforming hierarchical model by 1.7 and 0.7 points in average macro and micro F1 scores. The higher average performance for self-critic model, as evident from results in Table 2, can be partly attributed to the lower accuracy of the subtopic structure identification models. In addition, self-critic framework allows the model to identify subtopic boundary sentences by directly optimizing numerical performance thereby identifying the subtopic structures that are optimal with respect to the parameters of content-type classification model. This is also evident from the improved performance of the three actor-critic models, RL-TT, RL-IP and RL-IP/TT. Actor-critic based learning helps models to learn to identify subtopic boundary sentences that are useful for news discourse profiling but not strictly same as the used subtopic structures. Overall, our best performing RL-IP/TT model yields 2.72 and 0.97 points higher macro F1 and micro F1 score over the hierarchical model on the validation set, with comparable margin of improvement on the average F1 scores on test dataset.  The consistent improvement from different models provides evidence that learning latent content organization structure to further segment documents and modeling both local subtopic representations and global document representations leads to better main topic induction and achieve improved contenttype classification performance. Between TextTiling and inverted pyramid subtopic structures, we observe that the latter performs better in both joint learning and actor-critic learning frameworks. This is expected since Tex-Tiling is a broad-genre subtopic structure while inverted pyramid structure is specific to the news articles. Further, our rules ( §5.2) used to build local inverted pyramid structures directly correlate with the content types of sentences. It is also worth noting that jointly using TextTiling and inverted pyramid structures as critics performs better than each structure when used individually.

Distributions of subtopic boundary sentences
In Table 3, we examine distributional overlap among subtopic boundary sentences identified by RL-IP/TT model (RM) and the TextTiling (TT) and RM IP TT Temporal frames 13 18 7 Table 4: Subtopic boundary sentences overlap between Temporal-frames based subtopic structure and TextTiling, inverted pyramid, and RL-IP/TT model on a subset of 10 documents from validation dataset. There are total 68 subtopic boundary sentences identified by RL-IP/TT model, and 79, 52 and 58 subtopic boundary sentences identified by inverted pyramid, TextTiling, and temporal frames-based structures respectively. inverted pyramid (IP) subtopic structures on the validation dataset. We observe that subtopic boundary sentences identified by RL-IP/TT model exhibit higher overlap with the inverted pyramid (324) than the TextTiling structure (236). Interestingly, models based on inverted pyramid structure obtain better performance than the TextTiling-based models (Tables 1 and 2). The higher overlap for inverted pyramid structure corroborates its greater effectiveness in inducing appropriate subtopic structure for discourse profiling.
In addition, we manually annotated a subset of 10 documents from the validation dataset with subtopic structure based on temporal frames (as shown in Figure 1). As shown is Table 4, subtopic boundary sentences identified by RL-IP/TT model exhibit overlap with the temporal frames-based subtopic structure. This is not implausible given overlap between temporal frame-based subtopic structure and inverted pyramid, and TextTiling, as noted from Table 4. In a nutshell, subtopic boundary sentences identified by different subtopic structures exhibit partial overlap, and by using multiple critics to guide actor network in our actor critic formulation, we can enable the model to learn subtopic structure that is not necessarily identical to the used critics but more effective in profiling discourse structure for a given document.

Conclusion
We have presented a document-level contentorganization structures aware neural network model for discourse profiling. We explored actorcritic learning based frameworks to induce subtopic structures in a news document. Then, we model two levels of interactions -between sentences and the local subtopic representation, and between subtopic representations and the document -that consistently outperformed the previous best hierarchical model. For future work, we intend to experi-ment with new modeling techniques to incorporate explicit subtopic structures. Further, we plan to extend topical structures to other discourse structures such as rhetorical relations and model their interdependencies with the discourse profiling task.