A Neural Transition-based Model for Argumentation Mining

The goal of argumentation mining is to automatically extract argumentation structures from argumentative texts. Most existing methods determine argumentative relations by exhaustively enumerating all possible pairs of argument components, which suffer from low efficiency and class imbalance. Moreover, due to the complex nature of argumentation, there is, so far, no universal method that can address both tree and non-tree structured argumentation. Towards these issues, we propose a neural transition-based model for argumentation mining, which incrementally builds an argumentation graph by generating a sequence of actions, avoiding inefficient enumeration operations. Furthermore, our model can handle both tree and non-tree structured argumentation without introducing any structural constraints. Experimental results show that our model achieves the best performance on two public datasets of different structures.


Introduction
Argumentation mining (AM) aims to identify the argumentation structures in text, which has received widespread attention in recent years (Lawrence and Reed, 2019). It has been shown beneficial in a broad range of fields, such as information retrieval (Carstens and Toni, 2015;Stab et al., 2018), automated essay scoring (Wachsmuth et al., 2016;Ke et al., 2018), and legal decision support (Palau and Moens, 2009;Walker et al., 2018). Given a piece of paragraph-level argumentative text, an AM system first detects argument components (ACs), which are segments of text with argumentative meaning, and then extracts the argumentative relations (ARs) between ACs to obtain an argumentation graph, where the nodes and edges represent ACs and ARs, * Equal Contribution † Corresponding Author  Figure 1: An example of argumentation mining from the CDCP dataset (Park and Cardie, 2018). Policy, Fact, and Value represent the types of ACs and Reason refers to the types of ARs. Note that, the CDCP dataset we use is preprocessed by Niculae et al. (2017). respectively. An example of AM is shown in Figure 1, where the text is segmented into five ACs, and there are four ARs. In this instance, the types of AC2 and AC3 are Fact (non-experiential objective proposition) and Value (proposition containing value judgments), respectively. In addition, there is an AR from AC2 to AC3, i.e., "The check either bounced or it did not." is the reason of "There's not a lot of grey area here.", for the latter is a value judgment based on the fact of the former. Generally, AM involves several subtasks, including 1) Argument component segmentation (ACS), which separates argumentative text from non-argumentative text; 2) Argument component type classification (ACTC), which determines the types of ACs (e.g., Policy, Fact, Value, etc.); 3) Argumentative relation identification (ARI), which identifies ARs between ACs; 4) Argumentative relation type classification (ARTC), which determines the types of ARs (e.g., Reason and Evidence). Most previous works assume that subtask 1) ACS has been completed, that is, ACs have been segmented, and focus on other subtasks (Potash et al., 2017;Kuribayashi et al., 2019;Chakrabarty et al., 2019). In this paper, we also make such an assumption, and perform ACTC and ARI on this basis.
Among all the subtasks of AM, ARI is the most challenging because it requires understanding complex semantic interactions between ACs. Most previous works exhaustively enumerate all possible pairs of ACs (i.e., all ACs are matched to each other by Cartesian products) to determine the ARs between them (Kuribayashi et al., 2019;Morio et al., 2020). However, these approaches are of low efficiency and can cause class imbalance, since the majority of AC pairs have no relation. Besides, due to different annotation schemes, there are mainly two kinds of structures of argumentation graphs, tree (Stab and Gurevych, 2014;Peldszus, 2014) and non-tree (Park and Cardie, 2018). Briefly, in tree structures, each AC has at most one outgoing AR, but there is no such restriction in non-tree structures ( Figure 1). However, studies on these two kinds of structures are usually conducted separately. To date, there is no universal method that can address both tree and non-tree structured argumentation without any corpus-specific constraints.
Towards these issues, we present a neural transition-based model for AM, which can classify the types of ACs and identify ARs simultaneously. Our model predicts a sequence of actions to incrementally construct a directed argumentation graph, often with O(n) parsing complexity. This allows our model to avoid inefficient enumeration operations and reduce the number of potential AC pairs that need evaluating, thus alleviating the class imbalance problem and achieving speedup. Also, our transition-based model does not introduce any corpus-specific structural constraints, and thus can handle both tree and non-tree structured argumentation, yielding promising generalization ability. Furthermore, we enhance our transition-based model with pre-trained BERT (Devlin et al., 2019), and use LSTM (Hochreiter and Schmidhuber, 1997) to represent the parser state of our model.
Extensive experiments on two public datasets with different structures show that our transitionbased model outperforms previous methods, and achieves state-of-the-art results. Further analysis reveals that our model is of low parsing complexity and has a strong structure adaptive ability. To the best of our knowledge, we are the first to investigate transition-based methods for AM.

Related Work
In computational AM, there are mainly two types of approaches to model argumentation structures, that is, tree and non-tree.

Tree Structured AM
Most previous works assume that the argumentation graphs can be viewed as tree or forest structures, which makes the problem computationally easier because many tree-based structural constraints can be applied.
Under the theory of Van Eemeren et al. (2004), Palau and Moens (2009) modeled argumentation in the legal text as tree structures and used handcrafted context-free grammar to identify these structures. Presented by Gurevych (2014, 2017), the tree structured Persuasive Essay (PE) dataset has been utilized in a number of studies in AM. Following this dataset, Persing and Ng (2016) and Stab and Gurevych (2017) leveraged the Integer Linear Programming (ILP) framework to jointly predict ARs and AC types, in which several structural constraints are defined to ensure the tree structures. The arg-microtext (MT) dataset, created by Peldszus (2014), is another tree structured dataset. Studies on this dataset usually apply decoding mechanisms based on tree structures, such as Minimum Spanning Trees (MST) (Peldszus and Stede, 2015) and ILP (Afantenos et al., 2018).
Regarding neural network-based methods, Eger et al. (2017) studied AM as a dependency parsing and a sequence labeling problem with multiple neural networks. Potash et al. (2017) introduced the sequence-to-sequence based Pointer Networks (Vinyals et al., 2015) to AM, and used the output of encoder and decoder to identify AC types and the presence of ARs, respectively. Kuribayashi et al. (2019) proposed an argumentation structure parsing model based on span representation, which used ELMo (Peters et al., 2018) to obtain representations for ACs.

Non-tree Structured AM
Those studies described in Section 2.1 are all based upon the assumption that the argumentation forms tree structures. However, this assumption is somewhat idealistic since argumentation structures in real-life scenarios may not be such well-formed. Hence, some studies have focused on non-tree structured AM, and these studies typically use the Consumer Debt Collection Practices (CDCP) (Park and Cardie, 2018) dataset. Regarding this dataset, Niculae et al. (2017) presented a structured learning approach based on factor graphs, which can also handle the tree structured PE dataset. However, the factor graph needs to be specifically designed according to the types of argumentation structures. Galassi et al. (2018) adopted residual networks for AM on the CDCP dataset. Recently, Morio et al. (2020) proposed a model devoted to non-tree structured AM, with a task-specific parameterization module to encode ACs and a biaffine attention module to capture ARs.
To the best of our knowledge, until now there is no universal method that can address both tree and non-tree structured argumentation without any corpus-specific design. Thus, in this work, we fill this gap by proposing a neural transition-based model that can identify both tree and non-tree argumentation structures without introducing any prior structural assumptions.

Transition-based Methods
Transition-based methods are commonly used in dependency parsing (Chen and Manning, 2014;Gómez-Rodríguez et al., 2018), and has also been successfully applied to other NLP tasks with promising performance, such as discourse parsing , information extraction (Zhang et al., 2019), word segmentation (Zhang et al., 2016) and mention recognition (Wang et al., 2018).

Task Definition
Following previous works (Potash et al., 2017;Kuribayashi et al., 2019), we assume subtask 1) ACS has been completed, i.e., the spans of ACs are given. Then, we aim at jointly classifying AC types (ACTC) and determining the presence of ARs (ARI). The reason why we do not jointly conduct AR type classification (ARTC) is that performing ARTC along with ACTC and ARI jointly will hurt the overall performance. More details on this issue will be discussed in Section 6.4.
Formally, we assume a piece of argumentation related paragraph P = (w 1 , w 2 , . . . , w m ) consisting of m tokens and a set X = (x 1 , x 2 , . . . , x n ) consisting of n AC spans are given. Each AC span x i is a tuple containing the beginning token index b i and the ending token index e i of this AC, i.e., The goal is to classify the types of ACs and identify the ARs, and finally obtain a directed argumentation graph with ACs and ARs representing nodes and edges, respectively.

Our Approach
We present a neural transition-based model for AM, which can jointly learn ACTC and ARI. Our model generates a sequence of actions in terms of the parser state to incrementally build an argumentation graph. We utilize BERT and LSTM to represent our parser state, which contains a stack σ to store processed ACs, a buffer β to store unprocessed ACs, a delay set D to record ACs that need to be removed subsequently, and an action list α to record historical actions. Then, the learning problem is framed as: given the parser state of current step t: (σ t , β t , D t , α t ), predict an action to determine the parser state of the next step, and simultaneously identify ARs according to the predicted action. Figure 2 shows the architecture of our model. In the following, we first introduce our transition system, then describe the parser state representation.

Action
Change of state Precondition Table 1: Actions designed in our transition system. R denotes the set of ARs extracted so far. For simplicity, we omit the superscript t and use the subscript i ∈ {0, 1, ...} to denote the element index in stack and buffer. For example, σ 0 |σ 1 |σ denotes the top two items in stack. An action can be selected only if its precondition is satisfied.
-- Table 2: Transition sequence for the text in Figure 1. For simplicity, we use indices to denote ACs.

Transition System
Our transition system contains six types of actions. Different actions will change the state in different ways, which are also summarized in Table 1: • SHIFT (SH): When β t is not empty and σ 1 is not in D t , pop β 0 from β t and move it to the top of σ t . • DELETE-DELAY (DE d ). When β t is not empty and σ 1 is in D t , remove σ 1 from σ t and D t , and keep β t unchanged. • DELETE (DE). When β t is empty, remove σ 1 from σ t and keep β t and D t unchanged. • RIGHT-ARC (RA). When β t is empty, remove σ 0 from σ t and assign an AR from σ 0 to σ 1 . • RIGHT-ARC-DELAY (RA d ). When β t is not empty, pop β 0 from β t and move it to the top of σ t . Then assign an AR from σ 0 to σ 1 and add σ 0 into D t for delayed deletion. This strategy can help extract more ARs related to σ 0 . • LEFT-ARC (LA). Remove σ 1 from σ t and assign an AR from σ 1 to σ 0 . Table 2 illustrates the golden transition sequence of the text in Figure 1. This example text contains five ACs and four ARs. At the initial state, all ACs are in buffer. Then, a series of actions change the parser state according to Table 1, and extract ARs simultaneously. This procedure stops when meeting the terminal state, that is, buffer is empty and stack only contains one element.

State Representation
We employ BERT to obtain the representation of each AC and use LSTM to encode the long-term dependencies of stack, buffer and action list.
Representation of ACs. We feed the input paragraph P = (w 1 , w 2 , . . . , w m ) into BERT to get the contextual representation matrix H ∈ R m×d b , where d b is the vector dimension of the last layer of BERT. In this way, paragraph P can be represented as H = (h 1 , h 2 , . . . , h m ), where h i is the contextual representation of the i-th token of P. Then, we use the AC spans set X = (x 1 , x 2 , . . . , x n ) to produce a contextual representation of each AC from H by mean pooling over the representations of words in each AC span. Specifically, for the i-th AC with span x i = (b i , e i ), the contextual representation of this AC could be obtained by: where u i ∈ R d b . In addition, following previous works (Potash et al., 2017;Kuribayashi et al., 2019), we also combine some extra features with u i to represent ACs, including the bag-of-words (BoW) vector, position and paragraph type embedding of each AC 1 . We denote these features of the i-th AC as φ i . Then, the i-th AC is represented by the concatenation of u i and φ i : Hence, the ACs in paragraph P can be represented as C = (c 1 , c 2 , . . . , c n ).
Representation of Parser State. Our transitionbased model utilizes the parser state to predict a sequence of actions. At each step t, we denote our parser state as (σ t , β t , D t , α t ). σ t and β t are stack and buffer, which store the representations of processed and unprocessed ACs, respectively. D t is the delay set that records ACs that need to be removed from stack subsequently. α t is the action list that stores the actions generated so far. At the beginning, all ACs are in the buffer, i.e., the initial parser state is ([ ], [c 1 , c 2 , . . . , c n ], ∅, [ ]). Then, a series of predicted actions will iteratively change the parser state. Specifically, at step t, we have σ t = (σ 0 , σ 1 , . . .), β t = (β 0 , β 1 , . . .), where σ i and β i indicate the representations of ACs in the stack and the buffer at the current state. In addition, we also have α t = (. . . , α t−2 , α t−1 ) where α i denotes the distributed representation of the i-th action obtained by a looking-up table E a . In order to capture the context information in the stack σ t , we feed it into a bidirectional LSTM: where S t ∈ R |σ t |×2d l is the output of LSTM from both directions, |σ t | is the size of stack, and d l is the hidden size of LSTM. Similarly, we can obtain the contextual representation of β t by: where B t ∈ R |β t |×2d l , |β t | is the size of buffer. Besides, in order to incorporate the historical action information into our model, we apply a unidirectional LSTM to process the action list: where A t ∈ R |α t |×d l , |α t | is the size of action list. Furthermore, since the relative distance between the pair (σ 0 , σ 1 ) is a strong feature for determining their relations, we represent it as an embedding e d through another looking-up table E d . Thus, the parser state representation r t can be obtained by: where s 0 and s 1 denote the first and second elements of S t , b 0 is the first element of the B t , and a t−1 indicates the last action representation of A t .

Action Prediction
To predict the current action at step t, we first apply a multi-layer perceptron (MLP) with ReLU activation to squeeze the state representation r t to a lower-dimensional vector z t , and then compute the action probability by a softmax output layer: where W α denotes a learnable parameter matrix, b α is the bias term, α t is the predicted action for step t. A(S) represents the set of valid candidate actions that may be taken according to the preconditions. For efficient decoding, we greedily take the candidate action with the highest probability. With the predicted action sequence, we could identify ARs according to Table 1. Note that, the univocal supervision over actions for one input paragraph is built based on the gold labels of ARs.

Training
We jointly train an AC type classifier over the AC representations: p(y i |C) = sof tmax(MLP c (c i )), where y i is the predicted type for the i-th AC. Finally, combining this task with action prediction, the training objective of our model can be obtained: where λ is the coefficient of L 2 -norm regularization, and θ denotes all the parameters in this model.
The PE dataset contains 402 essays (1,833 paragraphs), in which 80 essays (369 paragraphs) are held out for testing. There are three types of ACs in this dataset: Major-Claim, Claim, and Premise. Also, each AC in PE dataset has at most one outgoing AR. That is, the argumentation graph of one paragraph can be either directed trees or forests. We extend each AC by including its argumentative marker in the same manner as Kuribayashi et al. (2019).
The CDCP dataset consists of 731 paragraphs, and 150 of them are reserved for testing. It provides five types of ACs (propositions): Reference, Fact, Testimony, Value, and Policy. Unlike PE dataset, each AC in CDCP dataset can have two or more outgoing ARs, thus forming non-tree structures.

Implementation Details
For PE dataset, we randomly choose 10% of the training set as the validation set, which is consistent with the work of Kuribayashi et al. (2019). For CDCP dataset, we randomly choose 15% of the training set for validation. Following Potash et al. (2017), for ACTC, we employ F 1 score for each AC type and their macro averaged score to measure the performance. Similarly, for ARI, we present F 1 scores for the presence/absence of links between ACs and their macro averaged score. All experiments are performed 5 times with different random seeds, and the scores are averaged.
We finetune uncased BERT Base 2 in our model. AdamW optimizer (Loshchilov and Hutter, 2019) is adopted for parameter optimization, and the initial learning rates for the BERT layer and other layers are set to 1e-5 and 1e-3, respectively. All LSTMs are 1 layer with the hidden size of 256, and the hidden size of MLP is 512. Besides, the dropout rate (Srivastava et al., 2014) is set to 0.5, and the batch size is set to 32. All parameters of our model are unfixed and can be learned during training. We train the model 50 epochs with early stopping strategy, and choose model parameters with the best performance (average of macro F 1 scores of ACTC and ARI) on the validation set. Our model is implemented in PyTorch (Paszke et al., 2019) on a NVIDIA Tesla V100 GPU.

Baselines
In order to evaluate our proposed BERT-Trans model, we compare it with several baselines. 2 https://github.com/huggingface/ transformers For PE dataset, the following baselines are compared: Joint-ILP (Stab and Gurevych, 2017) jointly optimizes AC types and ARs by Integer Linear Programming (ILP). St-SVM-full is structured SVM with full factor graph, which performs best on PE dataset in the work of Niculae et al. (2017). Joint-PN (Potash et al., 2017) applies Pointer Network with attention mechanism to AM, which can jointly address both ACTC and ARI. Span-LSTM (Kuribayashi et al., 2019) employs LSTM-minus-based span representation with pretrained ELMo embedding for AM, which is the current state-of-the-art method on PE dataset.
For CDCP dataset, we compare our model with the following baselines: Deep-Res-LG (Galassi et al., 2018) applies residual network model with link-guided training procedure, to perform ACTC and ARI. St-SVM-strict is structured SVM with strict factor graph, which performs best on CDCP dataset in the work of (Niculae et al., 2017). TSP-PLBA (Morio et al., 2020) uses task-specific parameterization to encode ACs and biaffine attention to capture ARs with ELMo based features, which is the current state-of-the-art method on CDCP dataset.
Furthermore, in order to show the effectiveness of our proposed transition system, we implemented two additional baselines: Span-LSTM-Trans incorporates the span representation method used in Span-LSTM and our transition system on PE dataset. For a fair comparison, features and ELMo used to represent ACs are consistent with that of Span-LSTM. ELMo-Trans replaces BERT in our proposed model with ELMo on CDCP dataset for a fair comparison with TSP-PLBA.
6 Results and Analysis

Main Results
The overall performance of our proposed model and the baselines are shown in Table 3 and Table 4. Our model achieves the best performance on both datasets. On PE dataset, our model outperforms the current sota model Span-LSTM by at least 1.1% and 1.4% in macro F 1 score over ACTC and ARI, respectively. On CDCP dataset, compared with TSP-PLBA, our model obtains at least 3.6% higher    macro F 1 score over ACTC, and achieves about 3.3% higher relation F 1 over ARI.
We also show the results where our BERT-based AC representation is replaced by the ELMo-based method, that is, Span-LSTM-Trans on PE dataset and ELMo-Trans on CDCP dataset. We found that, without employing pre-trained BERT, Span-LSTM-Trans and ELMo-Trans still outperform Span-LSTM and TSP-PLBA over ARI, respectively, which demonstrates the effectiveness of our proposed transition system. It can also be observed that our BERT-based AC representation method can further improve the model performance.
Some of the baselines improve overall performance by imposing structural constraints when predicting or decoding. For example, Joint-PN only predicts one outgoing AR for each AC to partially enforce the predicted argumentation graphs as tree structures. Similarly, to ensure tree structures, Span-LSTM applies MST algorithm based on the probabilities calculated by the model. However, these two methods can only deal with tree structured argumentation. The method proposed by Nic-ulae et al. (2017), which is based on factor graph, can handle both tree and no-tree structured argumentative text (St-SVM-full and St-SVM-strict), but the factor graph need to be specifically designed for datasets of different structures. Differently, our proposed model can handle datasets of both tree and non-tree structures without introducing any corpus-specific structural constraints and also outperforms all the structured baselines.

Ablation Study
We conduct ablation experiments on the PE dataset to further investigate the impacts of each component in BERT-Trans. The results are shown in Table  5. It can be observed that applying LSTM to encode buffer, stack, and action list contributes about 2.0% macro F 1 score of ARI, showing the necessity of capturing non-local dependencies in parser state. Also, incorporating buffer into parser state can improve the macro F 1 score of ARI by about 1.8%, for buffer can provide crucial information about subsequent ACs to be processed. Besides, the macro F 1 score of ARI drops heavily without action list (-1.6%), indicating that the historical action information has a significant impact on predicting the next action. Without the distance information between the top two ACs of the stack, the macro F 1 score of ARI decreases by 0.7%. The model components described above mainly affect ARI by modifying the parsing procedure, but have little impact on ACTC. However, BoW feature has a significant influence on both two tasks, and removing it causes 2.5% and 1.9% decreases in macro F 1 score of ACTC and ARI, respectively. Most previous models parse argumentation graphs by exhaustively enumerating all possible pairs of ACs, that is, all ACs are connected by Cartesian products, which lead to O(n 2 ) parsing complexity. Differently, our transition-based model can incrementally parse an argumentation graph by predicting a sequence of actions, often with linear parsing complexity. Concretely, given a paragraph with n ACs, our system can parse it with O(n) actions.

Parsing Complexity
Parsing complexity of our transition system can be determined by the number of actions performed with respect to the number of ACs in a paragraph. Specifically, we measure the length of the action sequence predicted by our model for every paragraph from the test sets of PE dataset and CDCP dataset and depict the relation between them and the number of ACs. As shown in Figure 3, the number of predicted actions is linearly related to the number of ACs in both two datasets, proving that our system can construct an argumentation graph with O(n) complexity. In addition, we also compared our model with the current state-of-the-art model on PE dataset, i.e., Span-LSTM, in terms of training time, and our model is around two times faster.

Joint Learning Analysis
Following Kuribayashi et al. (2019), we also try to add the task of AR type classification (ARTC) to our model for joint learning on PE dataset. However, as shown in Table 6, jointly learning ARTC together with ACTC and ARI degrades the overall performance, while learning ARTC separately actually yields better performance. Such an observation is consistent with the joint learning results    of Span-LSTM in Kuribayashi et al. (2019). The reason may be that the class labels are usually very unbalanced for ARTC (around 1:10 in PE dataset and 1:25 in CDCP dataset), such that the high uncertainty can seriously affect the overall learning. Thus, we mainly focus on joint learning of ACTC and ARI. We also argue that learning ARTC individually is better than jointly learning it with other subtasks. Besides, our model outperforms Span-LSTM over ACTC and ARI even when joint learning all three subtasks.

Structure Adaptive
To validate the structure adaptive ability of our model on both tree and non-tree structures, we analyze the structure type of the predicted argumentation graphs on the test set of both PE and CDCP datasets in Figure 4. It can be seen that for non-tree structured CDCP dataset, even though there are few non-tree structured paragraphs in the test set of CDCP (only 16%), our model is still able to identify 29.2% of them. This is an acceptable performance considering the poor results of ARI on the CDCP dataset due to the complex non-tree structures. For tree structured PE dataset, our model predicts all the paragraphs as tree structures, showing a strong structure adaptive ability. In contrast, most previous models like Joint-PN and Span-LSTM can only predict tree structures.

Conclusion
In this paper, we propose a neural transition-based model for argumentation mining, which can incrementally construct an argumentation graph by predicting a sequence of actions. Our proposed model can handle both tree and non-tree structures, and often with linear parsing complexity. The experimental results on two public datasets demonstrate the effectiveness of our model. One potential drawback of our model is the greedy decoding for action prediction. For future work, we plan to optimize the decoding process by using methods like beam search to further boost the performance.