Randomized Deep Structured Prediction for Discourse-Level Processing

Expressive text encoders such as RNNs and Transformer Networks have been at the center of NLP models in recent work. Most of the effort has focused on sentence-level tasks, capturing the dependencies between words in a single sentence, or pairs of sentences. However, certain tasks, such as argumentation mining, require accounting for longer texts and complicated structural dependencies between them. Deep structured prediction is a general framework to combine the complementary strengths of expressive neural encoders and structured inference for highly structured domains. Nevertheless, when the need arises to go beyond sentences, most work relies on combining the output scores of independently trained classifiers. One of the main reasons for this is that constrained inference comes at a high computational cost. In this paper, we explore the use of randomized inference to alleviate this concern and show that we can efficiently leverage deep structured prediction and expressive neural encoders for a set of tasks involving complicated argumentative structures.


Introduction
Many discourse-level NLP tasks require modeling complex interactions between multiple sentences, paragraphs or even documents. For example, analyzing opinions in online conversations (Hasan and Ng, 2013;Sridhar et al., 2015) requires modeling the dependencies between the opinions in individual posts, the disagreement between posts in long conversational threads and the overall view of users, given all their posts.
Learning in these settings is extremely challenging. It requires highly expressive models that can capture the claims made in each document, either by using a rich, manually crafted feature set, or by * Contributed equally to this work as first authors using neural architectures to learn an expressive representation (Ji and Eisenstein, 2014;Niculae et al., 2017). In addition, reasoning about the interaction between these decisions is often computationally challenging, as it requires incorporating domain-specific constraints into the search procedure, making exact inference intractable. As a result, most current work relies on highly engineered solutions, which are difficult to adapt. Instead of training structured predictors that model the interaction between decisions during training, they combine locally trained classifiers at test time (Stab and Gurevych, 2017).
Our goal in this paper is to study realistic settings, in which discourse-level problems can be learned efficiently when leveraging deep structured prediction, a framework for combining rich neural representation with an inference-layer, forcing consistency between them (Zheng et al., 2015). These models were applied successfully to simpler NLP tagging tasks (Lample et al., 2016), in which inference is tractable. However, as shown in a recent argumentation mining work (Niculae et al., 2017), their applicability to more complex learning tasks is not guaranteed.
Randomized inference algorithms have been proposed for structured NLP tasks, such as tagging and dependency parsing, in the context of linear models (Zhang et al., 2014(Zhang et al., , 2015Ma et al., 2019). This approach offers an efficient alternative to exact inference. Instead of finding the optimal output state, the algorithm makes greedy updates to a randomly initialized (or locally initialized) output assignment state. Our main contribution is to explore these ideas in the context of deep structured models composed of expressive text encoders, where theoretical guarantees are weak or nonexistent. Moreover, we do this for discourse-level tasks involving a rich set of domain constraints. To do this, we consider two variations of this approach. In the first, the algorithm samples and traverses only legal states (i.e., consistent with the constraints imposed by domain knowledge). In the second, these restrictions are ignored and only applied at test time. Adapting the sampling procedure to the specific constraints imposed by each domain is difficult, motivating the second approach as a generic alternative.
We focus on two discourse-level tasks, stance prediction in discussion forums, described above, and parsing argumentation structures in essays (Stab and Gurevych, 2017). The latter consists of constructing an argumentation tree that represents the type-of, and relation-between, the arguments made in the essay. Models for both tasks typically employ declarative inference for incorporating domain knowledge. Our experiments are designed to quantify the trade-off between different modeling choices, both in terms of task performance and computational cost. We compare exact ILP models, approximate inference based on the popular AD 3 algorithm (Martins et al., 2015) and the two randomized inference algorithms. Our experiments show that in all cases, deep structured prediction outperforms traditional shallow approaches, structured learning outperforms inference over locally trained models, and generic randomized inference performs competitively to exact inference.

Related Work
Using deep structured prediction for NLP has been studied in previous work, typically on traditional sentence-level tasks such as dependency parsing (Chen and Manning, 2014;Weiss et al., 2015), transition systems (Andor et al., 2016), named entity recognition (Lample et al., 2016) and sequence labeling systems (Ma and Hovy, 2016). In most of these cases, inference is tractable. More recently, some efforts have started to look at incorporating deep structured prediction to discourse tasks such as argument mining (Niculae et al., 2017), event and temporal relation extraction (Han et al., 2019) and discourse representation parsing (Liu et al., 2019). In all of these cases, constrained inference is formulated as an integer linear program and solved either using off-the-shelf optimizers or approximation algorithms like AD 3 (Martins et al., 2015).
Randomized approximation has been introduced as an alternative to exact inference. Zhang et al. (2014) suggest a simple randomized greedy inference algorithm and empirically demonstrate its effectiveness for dependency parsing and other traditional NLP tasks (Zhang et al., 2015). The theoretical results in (Honorio and Jaakkola, 2016), based on the probably approximately correct Bayes framework, characterize these findings by providing generalization bounds. More recently, Ma et al. (2019) extended the work of (Zhang et al., 2014(Zhang et al., , 2015 to structured prediction tasks with large structured outputs by leveraging local classifiers to find good starting solutions and improve the accuracy of search. All of these methods were evaluated on linear structured models. In this paper, we focus on two tasks: mining argumentative structures in essays and stance prediction in online debates. Stab and Gurevych (2017) approach argumentative essays using an exhaustive set of hand-crafted features, linear local classifiers and ILP at test time. Niculae et al. (2017) jointly learn to score multiple decisions while enforcing domain constraints. They explore structured SVMs and RNNs, using the AD 3 inference algorithm (Martins et al., 2015). On the other hand, there are several works on predicting user stances in online debates. Some approaches model the problem as a text classification task (Somasundaran and Wiebe, 2010;Sun et al., 2018), while other approaches take a collective approach to model user behavior and interactions (Walker et al., 2012;Hasan and Ng, 2013;Sridhar et al., 2015;Li et al., 2018). In the latter case, inference procedures include MaxCut, ILP and probabilistic soft logic (Bach et al., 2017).

Modeling
We look at two challenging structured prediction problems that deal with long texts where dependencies span across different paragraphs, documents and authors. To deal with these setups, we define neural factor graphs G = {Ψ} where each decision ψ i ∈ Ψ is associated with a neural architecture ρ i and a set of parameters θ i . In this section, we introduce the tasks in detail.

Argument Mining
This task aims to identify argumentative structures in essays. Each argumentative structure forms a tree, and there is a forest per document. Nodes correspond to spans of text in the document and they can be labeled either as claims, major claims or premises. Edges correspond to stances (i.e., support/attack relations between nodes). The spans of texts are given, and we need to label nodes, predict which pairs of nodes are connected by an edge and label the edges. Domain knowledge can be exploited as there are only valid edges between pairs of premises, a premise and a claim, or a claim and a major claim. At the same time, all trees are rooted at major claims. Similarly to previous work, we model second order relationships: grandparent (a → b → c) and co-parent (a → b ← c) (Martins and Almeida, 2014;Niculae et al., 2017). Figure 1 has a visual representation of the structure.
In this problem, each forest defines a factor graph Ψ and G is the collection of all documents. We define a set of five neural architectures corresponding to the five types of decisions that we need to make: N N = {ρ node , ρ link , ρ stance , ρ grandparent , ρ coparent }, each with its own set of parameters θ = {θ node , θ link , θ stance , θ grandparent , θ coparent }. Note that in principle, we can substitute each (ρ i , θ i ) with any neural architecture. We include details about the architectures in the experimental section.

Stance Prediction
Given a debate thread on a specific political issue, the task is to predict the stance of each post w.r.t. the issue (e.g., pro-life or pro-choice on abortion) (Walker et al., 2012). Following previous work, we model the problem as a collective classification task and consider all posts in a given thread. To do this, we add the task of predicting stance agreement between consecutive posts. As observed in Figure 1, each thread forms a tree, where users participate and respond to each other's posts. For a thread labeling to be valid, we need to enforce consistency between the node and edge labels.
In this case, each discussion thread defines a factor graph Ψ and G is the collection of threads. We define two neural architectures N N = {ρ stance , ρ agreement }, each with its own set of parameters θ = {θ stance , θ agreement }. As in the previous setup, each (ρ i , θ i ) can be substituted by any neural architecture, more details are outlined in the experimental section.

Learning
We learn a joint neural model that uses inference during training to ensure consistency across all decisions. Let Ψ be a factor graph with potentials ψ i ∈ Ψ over all possible structures Y . Let x i be the input vector to potential ψ i . Let θ = {θ i } be a set of parameter vectors associated with a set of neural networks ρ = {ρ i }, and ρ i (x i , y i ; θ i ) is the score for potential ψ i resulting from a forward pass.
Here y ∈ Y corresponds to the gold structure and y ∈ Y to the prediction resulting from the MAP inference procedure: Where C is a set of domain-specific constraints defined over the factor graph Ψ, and x c , y c indicates inputs and variables relevant to the constraints. In this work, we experiment with different algorithms to obtain or approximate the arg max, including the randomized procedures outlined in Section 5.
To learn θ, we use the structured hinge loss L(x, y,ŷ; θ) defined as: Where ∆(y,ŷ) is the Hamming loss. To introduce the Hamming loss into the objective, we perform loss augmented inference. The pseudo-code for the structured learning procedure can be observed in Algorithm 1. We implemented our models using DRaiL (Pacheco and Goldwasser, 2020), a declarative deep structured prediction framework built on PyTorch, and extended it to support our randomized inference procedures 1 .

Randomized Inference
In this section, we describe the randomized inference procedure used for each task. We define the Algorithm 1 Deep Structured Prediction 1: p ← 0 2: loss best ← ∞ 3: θ ← θ local 4: θ ret ← θ 5: while p < patience do 6:  p ← p + 1 24: return θ ret relevant domain constraints for each case, and explain how we sample solutions that respect them. Finally, we include a discussion about the theoretical bounds for the linear case.

Argument Mining
For randomized inference on argument mining, we adapt the randomized greedy algorithm proposed by Zhang et al. (2014). Algorithm 2 outlines the overall procedure. We will consider that each paragraph p ∈ P of an essay contains a single tree. We obtain a local optimum treeŷ by using the hill climbing algorithm, which is further described below. After that,ŷ is labeled and added to the forest Y . We iterate over each paragraph (line 4) and subsequently score the forest as: is the sum of the scores of the potentials for the predicted treeŷ. We add a weighted Hamming distance term to the scoring function in order to additionally penalize the score the more the tree structure differs from the gold structure. h y −ŷ 1 gets close to w if the distance is low, and close to zero if it is high.
More specifically, let y −ŷ 1 be in [0, 1], e.g., by dividing the number of node and edge differences by the total number of nodes and edges. In its simplest form, h can be assigned to −w, and thus S(ŷ) would become w if y =ŷ or 0 if they differ in every node and edge. Whenever the score of the locally improved forest is better than the forest found so far, Y becomes the new currently best scoring forestŶ . Since hill climbing might get stuck in a local optimum, we repeat line 3-9 for a constant number of restarts.
Algorithm 3 Hill Climbing 1:ŷ 0 ← initialize tree randomly for paragraph p 2: label(ŷ 0 ) 3:ŷ ←ŷ 0 4: t ← 0 5: repeat if S(ŷ t+1 ) > S(ŷ) then 12:ŷ ←ŷ t+1 13: t ← t + 1 14: until no improvement in this iteration 15: returnŷ Algorithm 3 describes the hill climbing procedure. It initially draws uniformly a treeŷ 0 at random. Then the greedy algorithm applies local updates onŷ t and attempts to achieve a better scoring treeŷ t+1 . This is done by iterating through a top-down level node list L ofŷ t . Denote i as the current position in the list, then the entire subtree of L i is connected to the node L j , whereas j = i − 1, i − 2, . . . , 0. If the score ofŷ t+1 is higher than the score ofŷ t , the newly generated tree is kept. The algorithm continues until the score can no longer be improved and therefore yields a local optimum tree. Figure 2 depicts how such local T1   T2   T4 T5   T3   T1   T2   T4 T5 T3   T1   T2   T5 T3 T4 Figure 2: Greedy local update of a treeŷ t (left) toŷ t+1 andŷ t+2 without score improvement updates are performed, L = (T 1 , T 2 , T 3 , T 4 , T 5 ).
It might be the case that a paragraph contains more than a single tree, therefore, when a tree is initially drawn at random, we introduce an additional phantom node which serves as the new root. This modification no longer restricts hill climbing on trees only. Moreover, it allows us having multiple roots and we treat the second layer of the tree like the top layer.
Domain Specific Constraints: For node labeling, we exploit domain knowledge. Major claims can only occur in the first or last paragraph, and there has to be at least one major claim in each essay. A root gets labeled as major claim with some fixed probability depending on the paragraph (first or last), holding the condition that there has to exist at least one. Any other root is labeled as a claim in each paragraph. Nodes having an edge to a major claim are labeled as claims as well. All remaining nodes are premises. An edge can have either the label support or attack and we draw all edge labels randomly with a probability of 0.9 being a support label. The node and edge labels are determined after each iteration since scoring depends on both, links and labels.
In Section 6, we evaluate our models using randomized inference with and without domain specific constraints. In the latter case, all labels are chosen at random.

Stance Prediction
A debate thread provides a fixed structure, thus nodes and links are predefined and no improvement of the tree structure needs to be done. However, nodes and edges still need to be labeled and can be improved. Initially, we pick the node labels, which can either be pro or con. Following the observations made by Ma et al. (2019), we leverage local classifiers and greedily chose the label with the highest score for each node.
Domain Specific Constraints: To respect the dependencies between node and edges labels, we use the following heuristic: If two consecutive nodes u and v have different stances, the edge (u, v) receives a disagreement label, if they share the same stance, (u, v) gets an agreement label. When author constraints are considered as well, we additionally force stances of posts to be equal when written by the same author.
We attempt to improve node labels by flipping them randomly and subsequently adjust the edge labels. This is done until an iteration no longer improves the overall score. We restart the algorithm for a constant number of times in order to increase the chance of achieving a global optimum. In the experiments, we evaluate our models using randomized inference with and without domain specific constraints. When constraints are not used, a random node is flipped and its adjacent edge adjusted, without enforcing consistency in the whole tree.
The error of the constrained randomized algorithms can be bound for the linear case. Let's define the norm of the set of parameter vectors θ as follows: θ = θ i ∈θ θ i 2 , where θ i is the Euclidean norm of the parameter vector θ i . Let n be the number of training samples. From Theorem 2 and Claim ii in (Honorio and Jaakkola, 2016), for ρ i (x i , y i ; θ i ) linear in θ i , the generalization bound (i.e., the difference between the test error and training error) is on the order of θ 2 /n+ θ / √ n+max(1/ log 2, θ 2 ) log 3/2 n/ √ n. The above generalization bound is decreasing in n, and increasing in θ , which suggests the use of a large training set, and the penalization of the norm θ during learning. In our experiments, we show that in practice we can obtain competitive results by implementing the randomized algorithms for the non-linear case.

Experiments
We learn our models using four different inference procedures: (1) ILP defines the inference problem as an integer linear program and uses the Gurobi solver 2 to perform exact inference, (2) AD 3 /ILP translates the ILP program into an AD 3 instance to perform approximate inference, (3) Rand-C uses the randomized method with domain constraints, and (4) Rand uses the randomized method without domain constraints. Note that we always use exact inference to evaluate on both the development and test sets. For completeness, we add an entry AD 3 where we use AD 3 for both training and testing. When using ILP or AD 3 , the domain constraints are expressed declaratively.

Argument Annotated Persuasive Essays
Dataset: We used the UKP dataset (Stab and Gurevych, 2017), consisting of 402 documents, with a total of 6,100 propositions and 3,800 links (17% of pairs). We use the train/dev/test splits used by Niculae et al. (2017), and report macro F1 for components and positive F1 for relations.
Learning and Representation: We did 5 repetitions and reported the average performance. Each repetition used a different seed to initialize the model parameters. For training, we used stochastic gradient descent, a patience of 10, weight decay of 1e-5, and 5 restarts for randomized inference. For local models, we used a learning rate of 0.05 and for structured learning we used a learning rate of 1e-4. Similarly to previous work on deep structured prediction (Han et al., 2019), we obtained better results by performing structured learning over locally trained models, instead of training them from scratch. To represent the component and the essay, we used a BiLSTM over the words, initialized with GloVe embeddings (Pennington et al., 2014), concatenated with a feature vector following Niculae et al. (2017), without the word features. For representing the relation, we use the components, as well as the relation features used in Niculae et al. (2017). For shallow models, we use a bag-ofwords representation for the text and concatenate it with the rest of the features into a single feature vector. Both the feature extraction and the neural implementations are available in the repository.
We test two versions of the model: (1) Base includes node labeling, link prediction and link labeling, and (2) Full adds grandparent and co- We can analyze the results across three dimensions: Structured Learning: The advantage of leveraging more structural dependencies can be seen in Table 1. The model gets increasingly better as more dependencies are considered, and using global learning outperforms learning local models and using inference just at prediction time (L+I).
Deep vs. Shallow: There is a consistent trend showing that deep structured models are more expressive than their shallow counterparts, as we can see by comparing average results in Table 1. To obtain good results using linear classifiers, Stab and Gurevych (2017) relied on an exhaustive set of features (Table 2). These numbers cannot be replicated by using just word-features and the feature set suggested by Niculae et al. (2017), as our shallow models and their structured SVM results show. In contrast, deep models and word embeddings are able to leverage this information without additional features. In addition, we find that deep models have a shorter overall training time (3.3x faster for the full model). This can be attributed to the compact embedding representation used in deep models, in contrast to the large sparse onehot vectors used in linear models. Similarly to previous work (Niculae et al., 2017), we find that higher-order factors and strict constraints are more helpful when using deep structured models than in their shallow counterparts.  (Niculae et al., 2017) 79.3 50.1 64.7 Struct RNN full (Niculae et al., 2017) 76.9 50.4 63.6 Struct SVM strict (Niculae et al., 2017) 77.3 56.9 67.1 Struct SVM full (Niculae et al., 2017) 77.6 60.1 68.9 Joint PointerNet (Potash et al., 2017) 84  Randomized vs. ILP/AD3: When using deep structured prediction, we did not find a statistically significant difference in the performance of the models that were trained with ILP/AD3 vs. the ones that were trained with constrained and nonconstrained randomized inference. We obtain competitive results with respect to previous work that relies on the same underlying embeddings or features, as observed in Table 2. Recently, Kuribayashi et al. (2019) were able to further improve performance by exploiting contextualized embeddings that look at the whole document, instead of embedding the arguments in isolation, and by making a distinction between argumentative markers and argumentative components. We attempted document-level contextualized embeddings using BERT and were not able to replicate their success 3 . Moreover, we found no significant improvement on the structured prediction models when replacing our BiLSTM encoders with either BERT or document-level BERT. We leave the exploration of an effective way to leverage contextualized embeddings for future work. As for stance prediction, Stab and Gurevych (2017) identify stances over the resulting structure and obtain a macro F1 of 68.0. Our full models obtain commensurate results, 69.2, 68.4 for ILP and randomized inference, respectively.

Debate Stance Prediction
Dataset: We use a subset of the 4FORUMS dataset from the Internet Argument Corpus (Walker et al., 2012), which consist of a total of 418 discussion threads on four political issues, containing 24,658 posts. We use the same splits as (Li et al., 2018). Most previous work reports accuracy. However, given that labels are highly imbalanced, we also report macro F1.
Learning and Representation: We model the problem as a collective classification task by predicting disagreement between consecutive posts in a given thread. We represented posts using a BERT encoder. For disagreement, we just represented pairs of posts without additional information. We do 5-fold cross validation and report the average performance. For training, we used AdamW, weight decay of 1e-5, a patience of 3, and 50 restarts for randomized inference. For local models, we used a learning rate of 5e-5 and for structured models we used a learning rate of 2e-6. For structured learning, we initialize the parameters using the local models. Note that we keep fine-tuning BERT during training.  We test two versions of the model: (1) Base includes consistency between node and edge labels, and (2) AC adds author constraints enforcing the same stance for all posts by the same author.
Structured Learning: We can also see that the performance of all structured models outperforms learning local models and using inference just at prediction time (L+I), both for post stance (Table 3) and for disagreement (Table 4).
Randomized vs. ILP/AD3: In the case of stance prediction, there is a significant trend in the performance of the different inference methods. Learning with exact inference generally outperforms the randomized constrained procedure, and the latter outperforms its non-constrained version. The differ-    (Li et al., 2018) use author profile information in their models, whereas we only look at text Table 5 compares our models to previous work on this dataset. Sridhar et al. (2015) use probabilistic soft logic (PSL) to learn a global assignment for the post labels. They use local classifiers to obtain the input scores to PSL. The main difference between their approach and ours is that we are able to backpropagate the global error back into the classifiers, and we find that it improves performance considerably. Even though we use BERT encoders in our structured procedure, we can see that BERT alone is not able to solve the task. Lastly, we compare to the structured representation learning method of Li et al. (2018) and find that we are able to improve on abortion and gay marriage only. Note that these two are the issues with more data available (8,000 and 7,000 posts respectively). The main difference with their approach and ours is that they push author profile information into the learned representation. We hypothesize that this is key to obtain good performance for gun control, which contains only 4,000 posts.  In our experiments, randomized inference always outperforms ILP and AD 3 in terms of runtime. Figure 3 shows the speedup factor per epoch against ILP and AD 3 . In argument mining, AD 3 is faster than ILP, except on our full model, where both perform similarly. We noticed that ILP consumes a lot of time in initialization and encoding. The randomized inference approach is able to predict argumentative structures 9.1x faster than ILP for our base model, and 7.5x faster than AD 3 for our full model. For stance prediction on 4FORUMS, ILP is considerably faster than AD 3 , we presume that this is due to the fact that Gurobi is a highly optimized commercial software, and our graphs are small. Randomized inference is 11x faster than ILP on the base model and beats AD 3 by a factor of 27 when author constraints are used. We also measured pure inference time over five training runs and took the average. Figure 4 shows (in logarithmic scale) plain inference runtime in seconds on the training set for all of our models. We can observe that randomized inference without domain constraints has almost the same performance as the constrained version. Again we find that randomized inference considerably outperforms ILP and AD 3 .  Figure 5: Impact of performance and runtime depending on the number of restarts for randomized inference Additionally, we evaluated our model at test time by replacing exact inference with randomized inference, incrementally increasing the number of restarts. Figure 5 shows the performance and runtime of the Rand-C algorithm with respect to exact inference i.e., Rand-C ILP . Figure 5a shows that the global optimum is closely approached after just 20 restarts for the argument mining task, as opposed to stance prediction on 4FORUMS, where a higher number of restarts is required. This is in line with our reported results in Sections 6.1 and 6.2. Figure 5b shows that randomized inference is about twice as fast than ILP when using 50 restarts for the Argument Mining task, and it starts to approach the time needed for ILP after 100 restarts. On the other hand, the randomized algorithm on 4FORUMS continues to be an order of magnitude faster even when doing 100 repetitions. Note that as the number of restarts keeps increasing, the randomized procedure will eventually surpass the time needed to perform exact inference.

Summary
We studied the effectiveness of randomized inference for deep structured prediction and obtained positive results for two challenging discourse-level tasks. We showed that, in practice, we can train complex structured models, using expressive neural architectures, and get competitive results at a lower computational cost. Moreover, we saw that combining expressive representations and inference is a promising direction for modeling discourse-level structures. Future directions include expanding the discussion to other tasks involving more complex structures, as well as exploring shared representations across different sub-tasks.