Reducing Discontinuous to Continuous Parsing with Pointer Network Reordering

Discontinuous constituent parsers have always lagged behind continuous approaches in terms of accuracy and speed, as the presence of constituents with discontinuous yield introduces extra complexity to the task. However, a discontinuous tree can be converted into a continuous variant by reordering tokens. Based on that, we propose to reduce discontinuous parsing to a continuous problem, which can then be directly solved by any off-the-shelf continuous parser. To that end, we develop a Pointer Network capable of accurately generating the continuous token arrangement for a given input sentence and define a bijective function to recover the original order. Experiments on the main benchmarks with two continuous parsers prove that our approach is on par in accuracy with purely discontinuous state-of-the-art algorithms, but considerably faster.


Introduction
Discontinuous phrase-structure trees (with crossing branches like the one in Figure 1(a)) are crucial for fully representing the wide range of syntactic phenomena present in human languages such as long-distance extractions, dislocations or crossserial dependencies, among others.
Although continuous approaches ignore these linguistic phenomena by, for instance, removing them from the original treebank (a common practice in the Penn Treebank (Marcus et al., 1993)), there exist different algorithms that can handle discontinuous parsing. Currently, we can highlight (1) those based in Linear Context-Free Rewriting Systems (LCFRS) (Vijay-Shanker et al., 1987), which allow exact CKY-style parsing of discontinuous structures at a high computational cost (Gebhardt, 2020;Ruprecht and Mörbitz, 2021); (2) a variant of the former that, while still making use of LCFRS formalisms, increases parsing speed by implementing a span-based scoring algorithm (Stern et al., 2017) and not explicitly defining a set of rules (Stanojević and Steedman, 2020;Corro, 2020); (3) transition-based parsers that deal with discontinuities by adding a specific transition in charge of changing token order (Versley, 2014;Maier, 2015;Maier and Lichte, 2016;Stanojević and Alhama, 2017;Coavoux and Crabbé, 2017) or by designing new data structures that allow interactions between already-created non-adjacent subtrees ; and, finally, (4) several approaches that reduce discontinuous constituent parsing to a simpler problem, converting it, for instance, into a non-projective dependency parsing task (Fernández-González and Martins, 2015;Fernández-González and Gómez-Rodríguez, 2020a) or into a sequence labelling problem (Vilares and Gómez-Rodríguez, 2020). In (4), we can also include the solutions proposed by Boyd (2007) and Versley (2016), which transform discontinuous treebanks into continuous variants where discontinuous constituents are encoded by creating additional constituent nodes and extending the original non-terminal label set (following a pseudo-projective technique (Nivre and Nilsson, 2005)), to then be processed by continuous parsing models and discontinuities recovered in a postprocessing step.
It is well known that discontinuities are inherently related to the order of tokens in the sentence, and a discontinuous tree can be transformed into a continuous one by just reordering the words and without including additional structures, an idea that has been exploited in practically all transitionbased parsers and other approaches (Vilares and Gómez-Rodríguez, 2020). However, in these models the reordering process is tightly integrated and inseparable from the parsing process.
Likely due to the lack of accurate models to accomplish reordering in isolation, we are not aware of any approach framed in (4) that explicitly reduces discontinuous constituent parsing into a continuous problem, keeping the original set of con- stituent nodes and solving it with a completely independent continuous parser that does not have to deal with an extended label set. Please note that existing approaches that perform discontinuous-tocontinuous conversion, such as (Boyd, 2007) and (Versley, 2016), not only modify the original discontinuous tree by including artificial constituent nodes and enlarging its label scheme (probably penalizing parsing performance), but they are not able to fully recover the original discontinuous tree due to limitations of the proposed encodings.
In this paper, we study the (fully reversible) discontinuous-to-continuous conversion by token reordering and how any off-the-shelf continuous parser can be directly applied without any further adaptation or extended label set. To undertake the independent token reordering, we rely on a Pointer Network architecture (Vinyals et al., 2015) that can accurately relocate those tokens causing discontinuities in the sentence to new positions, generating new sentences that can be directly parsed by any continuous parser. We test our approach 1 with two continuous algorithms (Kitaev et al., 2019;Yang and Deng, 2020) on three widely-used discontinuous treebanks, obtaining remarkable accuracies and outperforming current state-of-the-art discontinuous parsers in terms of speed.

Continuous Canonical Arrangement
Let w = w 0 , . . . , w n−1 be an input sentence of n tokens, and t a discontinuous constituent tree for w. We are interested in a permutation (reordering) w of w that turns t into a continuous tree t . While there can be various permutations that achieve this for a given tree, we will call continuous canonical arrangement (CCA) of w and t the permutation obtained by placing the tokens of w in the order given by an in-order traversal of t.
This permutation defines a bijective function, f : {0, . . . , n − 1} → {0, . . . , n − 1}, mapping each token at position i in w to its new CCA position j in w . Then, w can be parsed by a continuous parser and, by keeping track of f (i.e., storing original token positions), it is trivial to recover the discontinuous tree by applying its inverse f −1 . The challenge is in accurately predicting the CCA positions for a given sentence w (i.e. learning f ) without knowing the parse tree t, a complex task that will have a large impact on discontinous parsing performance, as observed by e.g. Vilares and Gómez-Rodríguez (2020), who recently dealt with reordering to extend their sequence-tagging encoding for discontinuous parsing.
In Figure 1, we depict how a discontinuous tree (a) is converted into a continuous variant (b) by applying function f to map each original position to its corresponding CCA position (c).

Pointer Networks
To implement function f and accurately obtain the CCA positions for each token, we rely on Pointer Networks (Vinyals et al., 2015). This neural architecture was developed to, given an input sequence, output a sequence of discrete numbers that correspond to positions from the input. Unlike regular sequence-to-sequence models that use the same dictionary of output labels for the whole training dataset, Pointer Networks employ an attention mechanism (Bahdanau et al., 2014) to select positions from the input, so they can handle as many labels as the length of each sentence instead of having a fixed output dictionary size.
For our purpose, the input sequence will be w and the output sequence, the absolute CCA positions (i.e., positions j in w ). Additionally, we keep track of already-assigned CCA positions and extend the Pointer Network with the uniqueness constraint: once a CCA position is assigned to an input token, it is no longer available for the rest of the sentence. As a consequence, the Pointer Network will just need n-1 steps to relocate each token of the original sentence from left to right, assigning to the last token the remaining CCA position.
Although the overall performance of the pointer is high enough, we note that the specific accuracy on tokens affected by discontinuities is substantially lower. This was expected due to the complexity of the task and can be explained by the fact that these kind of tokens are less frequent in the training dataset and, in languages such as English, the amount of discontinuous sentences is scarce, not providing enough examples to adequately train the pointer. To increase the pointer performance, we decided to jointly train a labeller in charge of identifying those tokens. More specifically, we consider that a token is involved in a discontinuity if its original position i differs from the CCA position j. This is regardless of whether the token is part of a discontinuous constituent or not, e.g., in Figure 1 it includes both the tokens in blue (that move left) and those in red (that move right). The idea behind this strategy is to prefer those models that better relocate tokens that change its absolute position in the resulting CCA.
While it can be argued that directly handling absolute CCA positions might underperform approaches that use relative positions instead (as reported by Vilares and Gómez-Rodríguez (2020)), we already explored that strategy and found that the use of relative CCA positions yielded worse accuracy in a Pointer Network framework. This can be mainly explained by the fact that we cannot apply the uniqueness constraint when relative positions are used, not reducing the search space while the sentence processing advances. Moreover, in regular sequence-to-sequence approaches, the use of relative positions leads to a lower size of the output dictionary, but this benefit has no impact in Pointer Networks since the size of the dictionary will always be the sentence length.

Neural Architecture
Following other pointer-network-based models (Ma et al., 2018;Fernández-González and Gómez-Rodríguez, 2019), we design a specific neural architecture for our problem: Encoder Each input sentence w is encoded, token by token, by a BiLSTM-CNN architecture (Ma and Hovy, 2016) into a sequence of encoder hidden states h 0 , . . . , h n−1 . To that end, each input token is initially represented as the concatenation of three different vectors obtained from character-level representations, regular pre-trained word embeddings and fixed contextualized word embeddings extracted from the pre-trained language model BERT (Devlin et al., 2019).
Decoder An LSTM is used to model the decoding process. At each time step t, the decoder is fed the encoder hidden state h i of the current token w i to be relocated and generates a decoder hidden state s t that will be used for computing the probability distribution over all available CCA positions from the input (i.e., j ∈ [0, n−1]\A, with A being the set of already-assigned CCA positions). A biaffine scoring function (Dozat and Manning, 2017) is used for computing this probability distribution that will implement the attention mechanism: where W, U and V are the weights and g 1 (·) and g 2 (·) are multilayer perceptrons (MLP).
The attention vector a t is then used as a pointer that, at time step t, will select the highest-scoring position j as the new CCA position for the token originally located at i.
The Pointer Network is trained by minimizing the total log loss (cross entropy) to choose the correct sequence of CCA positions. Additionally, a binary biaffine classifier (Dozat and Manning, 2017) that identifies relocated tokens is jointly trained by summing the pointer and labeller losses. Since the decoding process requires n − 1 steps to assign the CCA position to each token and at each step the attention vector a t is computed over the whole input, the proposed neural model can process a sentence in O(n 2 ) time complexity. Figure 2 depicts the neural architecture and the decoding procedure for reordering the sentence in Figure 1(a).

Setup
Data We test our approach on two German discontinuous treebanks, NEGRA (Skut et al., 1997) and TIGER (Brants et al., 2002), and the discontinuous English Penn Treebank (DPTB) (Evang and Kallmeyer, 2011) with standard splits as described in Appendix A.1, discarding PoS tags in all  cases. We apply discodop 2 (van Cranenburgh et al., 2016) to transform them into continuous treebanks. This tool follows a depth-first in-order traversal that reorders words to remove crossing branches. For all treebanks, we convert discontinuous trees in export format into continuous variants in discbracket format, using the resulting word permutation as CCAs for training the pointer and keeping track of the original word order for implementing the inverse function f −1 . Additionally, the resulting continuous treebanks in discbracket format are also converted by discodop into the commonly-used bracket format for training continuous parsers.
Pointer settings Word vectors are initialized with a concatenation of pre-trained structuredskipgram embeddings (Ling et al., 2015) and fixed weights extracted from one or several layers of the BASE and LARGE sizes of the pre-trained language model BERT (Devlin et al., 2019). In particular, we follow (Fernández-González and Gómez-Rodríguez, 2020b) and extract weights from the second-to-last layer for the BASE models and, for the LARGE models, we use a combination of four layers from 17 to 20. We do not try other variations that might probably work better for our specific task. While regular word embeddings are finetuned during training, BERT-based embeddings are kept fixed following a less resource-consuming strategy. See Appendix A.2 for further details.
Parsers For parsing the CCAs generated by the pointer, we employ two off-the-shelf continuous constituent parsers that excel in continuous benchmarks: the chart-based parser by Kitaev et al. (2019) and the transition-based model by Yang and Deng (2020). In both cases, we adopt the basic configuration (described in their respective papers) and just vary the encoder initialization with BERT BASE and BERT LARGE (Devlin et al., 2019), as well as XLNet (Yang et al., 2019).
Metrics Following standard practice, we ignore punctuation and root symbols for evaluating discontinuous parsing and use discodop for reporting F-score and discontinuous F-score (DF1). 3 For jointly evaluating the pointer and labeller performance, we rely on the Labelled Attachment Score 4 (LAS) and choose the model with the highest score on the development set. For reporting speeds, we use sentences per second (sent/s).  kens predicted by the pointer is higher (increased recall without harming precision), also leading to an improvement in F-score.

Results
In Table 2, we show how our novel neural architecture (combined with two continuous parsers) achieves competitive accuracies in all datasets, outperforming all existing parsers when the largest pretrained models are employed. It is important also to remark that F-scores on discontinuities produced by our setup (and where the pointer has an important role) are on par with purely discontinuous parsers. 5 Regarding efficiency, the proposed Pointer Network provides high speeds even with BERT LARGE : on the test splits, 553.7 sent/s on TIGER, 613.5 sent/s on NEGRA and 694.3 sent/s on DPTB. As a result, continuous parsers' efficiency is not penalized, and the pointer+parser combinations are faster than all existing approaches that use pre-trained language models (including the fastest parser to date by Vilares and Gómez-Rodríguez (2020), which is also outperformed by a wide margin in terms of accuracy). Finally, as also observed on continuous treebanks, no meaningful differences can be seen between both continuous parsers' performance.

Conclusions and Future work
We show that, by accurately removing crossing branches from discontinuous trees, continuous parsers can perform discontinuous parsing more efficiently, achieving accuracies on par with more expensive discontinuous approaches. In addition, the proposed Pointer Network can be easily combined with any off-the-self continuous parser and, while barely affecting its efficiency, it can extend its coverage to fully model discontinuous phenomena.
We will investigate alternatives to the inorder reordering (e.g., pre-and post-order traversal or language-specific rules to generate more continuous-friendly structures). While we think that using a different CCA would have no substantial impact on Pointer Network reordering, it might affect continuous parsing performance (as it may be easier for the parser to process reordered constituent trees with a syntax closer to original continuous structures, and factors like the degree of left vs. right branching may also have an influence).