CLaC at SemEval-2020 Task 5: Muli-task Stacked Bi-LSTMs

We consider detection of the span of antecedents and consequents in argumentative prose a structural, grammatical task. Our system comprises a set of stacked Bi-LSTMs trained on two complementary linguistic annotations. We explore the effectiveness of grammatical features (POS and clause type) through ablation. The reported experiments suggest that a multi-task learning approach using this external, grammatical knowledge is useful for detecting the extent of antecedents and consequents and performs nearly as well without the use of word embeddings.


Introduction
Conditional statements are of interest to investigations into the semantics of natural language text because they inform on factivity of statements as well as establish reasoning and argumentation in text. In formal logic, conditionals obey the principle of explosion (ex falso quod libed), and counterfactuals allow any conclusion. In language, however, assuming that facts had been different is frequently used to explore, for instance, causal relations relevant to the ongoing argument. Both cases differ significantly from basic assertions and flagging and classifying these very specific passages of text will enhance other semantic annotations.
The basic structure of a conditional is thus found in many utterances, often as a way to specify presuppositions or assumptions for a given statement. Conditionals consist of two parts, the antecedent and the consequent, as illustrated in Example 1, where the antecedent (the if part) is underlined and the consequent spans the remainder of the sentence. Counterfactuals are conditionals where the antecedent does not hold true.
Example 1 If there were peace, I wouldn't spend another second here.
While only few NLP systems attempt to model inference, a significant number is concerned with attributing degrees of factuality to different statements. Before conditional statements can be mined for their contribution to factivity judgments, they have to be detected. SemEval 2020 Subtask 5.2 (Yang et al., 2020) is concerned with identifying the span of antecedent and consequent clauses in text. The data samples consist largely of single sentences, but may involve several sentences. We stipulate that this is a mainly a structural task and experiment with the grammatical notion of clause boundaries. Encoding various clause types on top of POS tags and Glove Word Embeddings (WEs) (Pennington et al., 2014), we find that a clause type layer improves the performance of a baseline of only Glove WEs but barely improves on a two layer architecture encoding WEs and POS only. However, a two layer architecture encoding only POS and clause type shows competitive performance at a drastically reduced parameter space. Our system represented the median in the officially scored systems and demonstrates that simple grammatical notions can be stable contributors to this task. target labelsŶ i = ŷ i 1 ,ŷ i 2 , . . . ,ŷ in , whereŷ i k ∈ {A, C, O} for each token k in sample i. The task data, however, is presented as a sequence of characters and gold annotations use intervals of character offsets: Ant i = ch ant i 1 , ch ant i 2 , . . . ch ant i l is the interval of antecedent characters labelled A in S i , Con i = ch con i 1 , ch con i 2 , . . . ch con im is the character interval for the consequent, labelled C.
Preprocessing We define a strict input mapping f between character labels and token labels, with The corresponding output mapping trivially maps the label of a token to all its constituent characters.

Grammatical features
The task describes a strict grammatical pattern. (Kuncoro et al., 2018) observe that LSTMs are not strong on grammatical relations and they show that providing grammatical information can significantly improve LSTM performance. In this spirit, we extract grammatical features using a GATE pipeline with the ANNIE English Tokeniser (Cunningham et al., 2002), OpenNLP POS tagger (Apache Software Foundation, 2014), Stanford Parser (Klein and Manning, 2003) to extract POS tags and constituent tags S, SBAR, and SINV. Example 2 CLaC system token annotations: T=input token, P=POS, C=clause, L=output label

T:
If there were peace I would n't spend another second here . P: POS tag Content words do not greatly impact this task, thus the reduction of input tokens to their POS tags should illuminate the structural patterns. The Stanford Parser uses the Penn Treebank tagset with 45 tags (36 main tags and 9 tags for punctuation, parenthesis, etc.) The Penn Treebank tag IN includes prepositions like on and subordinating conjunctions like that, masking an important clue for the potential start of a consequent. We thus introduce an additional POS tag SC for subordinating conjunctions. This brings the number of POS tags used to 46.
POS Penn Treebank tagset (45 tags) POS1 assigns that to new POS tag SC POS2 assigns that and then to SC In ablation studies on a single layer architecture (that is, making the POS sequence the only input stream), POS2 performs better than POS1 and POS (see Table 1).
Clause tag Antecedent and consequent are clauses. To assist detection of the correct clause boundaries, we train a layer for relevant clause boundaries as determined by the Stanford parser.
The Penn Treebank tagset has 5 tags for clause constituents: S for simple declarative clauses, SBAR for complement clauses possibly introduced by a subordinating conjunction, SBARQ for direct questions introduced by a wh-word or a wh-phrase, SINV for inverted declarative sentences and SQ for inverted yes/no questions, or SBARQ for main clauses of a wh-question, following the wh-phrase (Bies et al., 1995).
In the general case, the antecedent is a subordinate clause, while the consequent is the main clause. 1 Subordinate clauses are labelled as SBAR, we select the lowest SBAR label on the path from a token to the sentence root for SBAR annotations in the input sequence. The same is true for the SINV label. Main clauses, however, are parents to subordinate clauses, not sisters, requiring a different processing.
We experiment with four variants, distinguished by the number and type of clauses included (φ specifies the number of tags encoded): CL1 encodes S, SBAR, φ = 2 CL1-1 encodes S, SBAR, SINV, φ = 3 CL1-2 encodes S, SBAR and additionally recodes SINV as SBAR, φ = 2 CL2 encodes S, S m , SBAR, φ = 3 CL2-1 encodes S, S m , SBAR, SINV, φ = 4 CL2-2 encodes S, S m , SBAR and additionally recodes SINV as SBAR, φ = 3 Clause level constituent tags are extracted from the parse tree. Let path(w i k ) be the the ordered multiset of constituent labels on the path from token w i k to the sentence root.
Clause level constituent tag encoding CL1 Let w i k be a input token.
is modelled on the simplest conditional statements. A refined version distinguishes a wider variety of patterns, namely between S for root clauses, S m for embedded main clauses, and SBAR and SINV for subordinate clauses.
Clause level constituent tag encoding CL2-1 and there is exactly one and there is more than one S ∈ path(w i k , root)

Architecture
Word embeddings We initialize the embedding layer of the respective models with Glove 2 pre-trained word vectors, fine-tuned during training. For input sample S i = w i k , ..., w in let X i = x i 1 , x i 2 , . . . , x in denote a sequence of word embeddings x i k ∈ R d , d = 300, where x i j is the embedding for token w i j .
Multi-task stacked Bi-LSTMs Our submitted system used 3 layers of Bi-LSTMs stacked on top of one another, all with input dimensionality d input = 300 and hidden dimensionality d h = 150. The first layer of the Bi-LSTM receives the embedding sequence X i . The output stream of layer 1 feeds into layer 2, the output stream of layer 2 feeds into layer 3. 3 Inspired by (Søgaard and Goldberg, 2016), the output at each layer is supervised for a different sequence labeling task by making predictions at each time step (see also Example 2). layer 1 POS supervision, W 1 ∈ R 300×ψ , where ψ is the number of POS tags layer 2 clause supervision, W 2 ∈ R 300×φ , where φ is the number of clause tags layer 3 main task supervision, W 3 ∈ R 300×3 The predicted labelŷ l i k for time-step k at layer l is determined by a simple linear classifier, parameterized by W l :ŷ l i k = Sof tmax(W l T x l i k ) k = 1, . . . , n; l ∈ {1, 2, 3} where x l i k is the representation for token w i k at layer l. W l is the classifier weight matrix at layer l.
Training paradigm At each forward pass, a layer is randomly selected based on a uniform distribution and the loss is calculated for the task corresponding to the selected layer. When performing backpropagation, the parameters of the selected layer as well as the parameters of all lower layers are updated. Our model is implemented using the PyTorch library (Paszke et al., 2017). The losses at all layers are computed using Cross-Entropy loss and the network is optimized by the Adam optimizer (Kingma and Ba, 2014). The learning rate is lr = 5 × 10 −4 and the network is trained for 7 to 10 epochs.
Post-processing There can only be one antecedent and only one consequent per data sample. Thus, if our system output shows disjoint regions for either antecedent or consequent, a post-processing step smooths over the gap.
Example 3 Post processing: T=token, L=system prediction, F= final, post-processed label

Results
We divide the original training set into training (3251 samples) and validation (300 samples) sets.
We present here only a small subset of extensive ablation on our variants. Note that the analysis presented here refers to our experiments on the validation data only. Single layer baselines Our first observation is that when used in a single layer architecture, POS1 trails our best three layer systems by less than .05 in F1 measure and less than 10 exact matches. POS2 outperforms Glove WE as a single layer baseline in F1, if not in exact matches, see Table 1.  Table 2. The clause encodings do not perform as well in single layer architectures, but in combination with WE, they demonstrate an increase in exact matches, as shown in Table 2. Interestingly, two layer architectures using only POS and clause features rival combinations using WEs but reduce the parameter space to 46 × 4. Table 3 outperform two layers in F1 as well as in exact matches for most cases, indicating that the grammatical information encoded is not sufficient and WEs stabilize and improve performance.

Three layer architectures The three layer architectures shown in
WE+POS1+CL1-1 is our submitted model (highlighted in Table 3). Other versions have identical F1, but superior exact matches. We see effects of overfitting on our validation set, since the ranking of our methods on the validation set does not always correspond with the ranking on the actual test set. For instance, the overall best performer on the validation set (WE+POS2+CL1-2) does not perform equally well on the test set. Noteworthy is the performance of a two layer architecture with no word embeddings on the test set: POS2+CL2-2 (Table 2). This presents a much reduced feature space for a very strong performance.  We interpret the consistency in results for precision and recall as a measure of the robustness of the system. The three layer architecture with clause level encoding was not strictly necessary for our performance, as WE+POS1 performed better on the test set in exact matches. However, the less balanced precision (0.864) and recall (0.763) for WE+POS1 on the test set suggests more volatile behaviour. Inversely, this suggests that clause type encoding has a stabilizing effect and Table 3 shows that including clause features has the potential for better performance.

Conclusion
Our goal was to test the possibility of grammatical information to improve exact matches of antecedent and consequent span detection. Ablation studies show that grammatical features by themselves form a solid baseline in a two layer Bi-LSTM architecture and dramatically reduce the parameter space for the task.
Our experiments demonstrate that multi-task stacked Bi-LSTM models can effectively super-encode the grammatical features POS and clause type, improving performance for both F1 and exact match scores. For this task, the difference in outcome barely justifies the increase in complexity for three layers. However, the combination results in stable systems that operate at the precision-recall break even point and that (slightly) outperform single and two layer models. They form thus a promising basis for semantically more complex tasks.