BERT-Proof Syntactic Structures: Investigating Errors in Discontinuous Constituency Parsing

The combined use of neural scoring systems and BERT ﬁne-tuning has led to very high results in many natural language processing (NLP) tasks. These high results raise two important questions about the contribution and the limitations of pretrained-language models: (i) what are the remaining errors in the best-performing systems? (ii) what are the types of test examples where pretrained language models help the most? In this paper, we investigate both questions for the task of English discontinuous constituency parsing on the Penn Tree-bank, for which recent models obtain close to 95 F 1 score. To do so, we propose two meth-ods for automatically analysing the errors of discontinuous parser. First, we annotate and release a test-suite focused on the syntactic phenomena responsible for discontinuities in the Penn Treebank, enabling us to obtain a per-phenomenon evaluation of a parser’s output. Second, we extend the Berkeley Parser Analyser — a tool that classiﬁes parsing errors according to predeﬁned structural patterns —, to discontinuous trees. We apply both methods to characterize errors of a state-of-the-art transition-based discontinuous parser, and to provide an overview of the contribution of BERT to this task.


Introduction
Discontinous constituency trees are phrase-based syntactic representations where the constraint stating that a single phrase must yield a continuous sequence of tokens is lifted. Such representations are well-suited for modelling long-range dependencies, that typically arise for some syntactic phenomena, such as extractions or scrambling. For example, Figure 1 presents a discontinous VP modelling the relationship between the verb want and its extracted complement How many.
In constituency treebanks, these long range dependencies are sometimes represented with typed many do you want ? empty categories (traces), coindexed with a displaced phrase (Marcus et al., 1993). However, projective parsers usually ignore them. 1 Indeed, the norm in Penn Treebank constituency parsing is to preprocess empty categories out of the corpus, leaving out important linguistic information. Formally, discontinuous constituency trees interpret as derivations from mildly context-sensitive grammar formalisms, such as linear context-free rewriting systems (Vijay-Shanker et al., 1987, LCFRS) or multiple context-free grammars (Seki et al., 1991, MCFG). As a result, exact parsing of discontinuous structures has high computational complexity. For example, CKY-style parsing of an LCFRS is O(n 3f ) in time (Kallmeyer, 2010), where f is the fan-out of the grammar: the maximum number of spans in a grammar rule. 2 The current state of the art for English discontinuous constituency parsing on the Discontinuous Penn Treebank has reached 94.8% F 1 score (Corro, 2020) obtained by a span-based chart parser that combines a neural scoring system and pretrained contextualized embeddings (Devlin et al., 2019, BERT). However, such a high score can be misleading. Despite ensuring comparability across different parsers, the exclusive use of classical evaluation metrics (F-score, precision, recall, as is standard) is hard to interpret and does not disclose information about the syntactic capabilities of a parser. In discontinuous parsing, the standard evaluator discodop 3 (van Cranenburgh et al., 2016) provides metrics that only focus on discontinuous constituents (discontinuous F-score, discontinuous precision, discontinuous recall). However, these scores aggregate information across many distinct syntactic phenomena.
In this paper, we propose to automatically analyse the errors of discontinuous English parsers in order to provide a fine-grained overview of their current limitations. To do so, we pursue two complementary approaches. First, we construct a test suite focused on 6 syntactic phenomena responsible for the discontinuities in the Discontinuous Penn Treebank (Evang, 2011). Second, we adopt an error-correction based approach: we search for a sequence of error-correcting tree modifications that lead from the predicted tree to the gold tree, and classify the sequence of tree modifications based on structural patterns. This is a direct extension of Berkeley Parser Analyser (Kummerfeld et al., 2012) to English discontinuous parsing.
A secondary motivation for this work is to characterize the contribution of BERT to discontinuous constituency parsing. An active current line of research consists in assessing the syntactic knowledge learned by language models (Linzen et al., 2016;Marvin and Linzen, 2018;Gulordava et al., 2018), including those with structural supervision (Kuncoro et al., 2018;Wilcox et al., 2019;Hu et al., 2020). They usually do so by constructing test items: minimal pairs of sentences, such that one is grammatical and the other is not (thus isolating a single grammatical constraint). Then, they observe whether the language model assigns higher probability to the grammatical alternative. In these papers, the observation of the syntactic ability of the models is indirect. We argue that fine-grained evaluation methods will help comparing the syntactic capabilities of parsers when they have access to BERT or not, which will provide a complementary view to this line of research. Therefore, we apply both proposed error analyses methods to a state-of-the-art transition-based discontinuous parser in several settings: without pretraining, with fast-text embeddings (Mikolov et al., 2018a;Grave et al., 2018), with BERT pretraining.
In summary, we make the following contributions: • We construct a test-suite for automating a fine grained evaluation of English discontinuous parsers on target phenomena. • We extend the Berkeley parser analyser to deal with English discontinuous constituency trees. • We use these two evaluation methods to characterize the errors of a neural discontinuous parser, trained in several pretraining settings. We provide the test suite and the error analyser as supplementary material.

Related Work
To address the limitations of using exclusively an F-score to evaluate constituency parsers, prior work focused on alternative finer-grained evaluation methods. We review some of them, both from the projective and discontinuous constituency parsing litterature.
Manual error analysis For discontinuous constituency parsing, Evang (2011) performed manual error analysis by extracting discontinuous trees from the evaluation corpus, classifying them according to the phenomenon at the origin of the discontinuities, and manually checking if a PLCFRS chart-parser recognized them.  used the same strategy to evaluate a neural transition-based discontinuous parser. However, manual error analysis is quite time-consuming and needs to be performed again for evaluating each new parser output. Thus, it is difficult to integrate it in an evaluation pipeline or to deploy it for many parsers.
Automatic error analysis Kummerfeld et al. (2012) introduced a method that consists in searching for a sequence of atomic tree-modifications (such as: inserting a node, removing a node, moving a node) that leads from a predicted constituency tree to the gold tree. Then, they classify the tree modifications according to predefined structural patterns, e.g., 'PP-attachment', 'NP-attachment', 'labelling error'. Their method led to identify most frequent patterns of error, and characterize the improvement obtained with techniques such as reranking. However, the structural patterns used to clas-sify mistakes depend both on the language of the treebank and on its annotation strategies. Therefore, error patterns need to be designed when adapting the error analyser to another treebank (Kummerfeld et al., 2013). Moreover, their method and software do not handle discontinuous constituents, hence our proposal.
Targeted evaluation Another line of work on fine-grained parser evaluation focused on specific structures or phenomena. Ratnaparkhi et al. (1994) introduced a collection of English sentences with PP-attachment ambiguities, in order both to improve evaluation on this type of structure, and foster research on improving their resolution. Kübler et al. (2009) introduced a test suite for German that encompasses a wider range of syntactic structures (such as coordination of unlike constituents or extraposed relative clauses). However, they focus on projective constituency representations.
For discontinuous structures, Maier et al. (2014) released discosuite, a testsuite for German. They annotated a set of sentences from the Tiger corpus (Brants et al., 2004), with the syntactic phenomena responsible for the tree discontinuities. They released their annotations, such that researchers can run their parsers on the sentences and compute a per-phenomenon evaluation of the parser. To the best of our knowledge, such a test suite only exists for German. In this article, we introduce one for English, along with an evaluation script that provides per-phenomenon statistics.
We focus our analysis on English, since the resources we introduce are for this language. However, we also provide results on German using discosuite.

Test Suite Annotation
This section describes our methodology to annotate a set of discontinuous constituency trees with the syntactic phenomena responsible for the discontinuities. We first extract all discontinuous trees from the validation section of the discontinuous version of the Penn Treebank (Evang, 2011;Evang and Kallmeyer, 2011), except those for which the discontinuities are only due to punctuation attachment. We obtain 266 trees, which corresponds to 16% of the corpus. Then, we manually assign one or several categories from the following set, previously proposed for manual error analysis by Evang (2011) and reused by . We provide an example sentence from the corpus for each category, with the main discontinuous constituent highlighted in bold: 1. There are several subtypes of wh-extractions in the data: relative clauses, verbal adjunct clauses, complement clauses, indirect and direct questions. They all include a wh word among how, when, which, that, where, what, why, whenever. Circumpositioned and fronted phrases only include quotations, and systematically feature a speech verb, usually says or said. It-extrapositions feature an expletive it in the interpretation location of an extraposed clausal argument. The category of discontinuous dependencies contains other cases where a constituent is split by an intervening phrase. It mostly includes extraposed modifiers, such as the extraposed clause in example 5 above, Not all occurrences of these phenomena result in a discontinuous tree (Evang, 2011). For example, a sentence containing both a fronted quotation and a subject-verb inversion will not result in a discontinuity. In some trees, there are also several occurrences of phenomena producing discontinuities. We release these annotations as a csv file, provided as supplementary material.
Per-phenomenon evaluation method In order to obtain a per-phenomenon evaluation of the predictions of discontinuous parsers, we first extract individual evaluations for each discontinuous tree, as provided by the standard evaluator for discontinuous parsing (van Cranenburgh et al., 2016, dis- Predicted tree:  codop). These include the number of gold, correct, and incorrect discontinuous constituents, both in the labelled and unlabelled case. 4 We consider that the annotated target phenomenon on the sentence is perfectly predicted if the sentence discontinuous F-score is 100, and partially predicted if it is > 0. As such, this evaluation is recall-oriented: we focus on how well the gold phenomena are predicted, but we do not take into account false positives (which would require us to assign a phenomenon to predicted trees with incorrect discontinuous constituents).

Error-Correction-Driven Analyser
We now focus on automatically classifying errors according to structural patterns. To do so, we build on Kummerfeld et al. (2012) and proceed in two steps: (i) finding a sequence of atomic tree modifications that transforms a predicted tree to the corresponding gold tree; (ii) classifying steps in the transformation sequence according to predefined structural patterns. For step (i), we use a greedy search algorithm, that first corrects errors on discontinuous nodes, and then backoffs to Kummerfeld et al. (2012)'s method for projective error correction. Thus, we focus in this section on discontinuous error corrections.
We use 4 atomic tree modifications: i change label of discontinuous node; ii create a discontinuous node; iii delete a discontinuous node; iv move a node, resulting in a discontinuity. 4 In the unlabelled case, we remove duplicate constituents (that correspond to unary rewrites) before evaluation as they are not interpretable. Therefore, it might happen that the labelled result is higher than the corresponding unlabelled one.
Gold tree (fragment): Predicted tree (fragment):  The creation of a discontinuous node (ii) consists in gathering several nodes with the same parent, attaching them as the children of a new node, which is attached in turn to their original parents. For example, in Figure 2, the parser missed a VP node. The correction consists in creating a discontinuous VP node with two children (ADJP and VBD nodes) and attaching it to the SINV node.
To delete a discontinuous node (iii), we simply attach its children to their grandparent node. For example, in Figure 3, the children of the highlighted VP to be deleted (lower part) will be attached to the higher VP. For both node creation and node deletion, the corrected tree has only a single different node with the original predicted tree.
Finally, the moving of a node involves reattaching a node to a different parent. For example, in Figure 4, the correction of the predicted tree will consist in attaching the WHNP to the lowest VP, resulting in two missing discontinuous constituents (both VPs, see gold tree). A side effect is that the moving will also result in a unary S constituent that should be deleted in another correction step.
In order to find a sequence of error-correcting modifications, we perform a greedy search. While there is a false positive or a false negative discontinuous constituent in the current tree, we try to apply actions (i-iv) in this order of priority. Modifications (i-iii) cannot introduce new errors, whereas Gold tree (fragment): Predicted tree (fragment):  moving a node (iv) may do so in some cases. 5 We ensure that moving a node is only performed if the correction does not increase the gross number of errors by more than 1. Once we have found the correction sequence, we classify errors according to patterns defined by Kummerfeld et al. (2012) Kummerfeld et al. (2012) to compute statistics about projective constituent mistakes.

Parser
We use a Python reimplementation of the parser described by , augmented with a mechanism to integrate and fine-tune BERT (Devlin et al., 2019). We release our code with pretrained models for replication purposes. 6 The parser is based on a simple transition system (ML-GAP) that features the GAP action (Coavoux and Crabbé, 2017) to construct discontinuous constituents, and separates structural and labelling actions (Cross and Huang, 2016). The scoring system has two submodules: • A sentence encoder that constructs contextualized embeddings for each token and is run before parsing; • A feed-forward network that predicts the next action from the contextualized embeddings of tokens extracted from specific positions in the parsing configuration. In the remainder of the paper, we call ML-GAP the baseline parser that has only access to the training corpus and has no pretrained parameters, ML-GAP+FT, when it has access to fasttext pretrained embeddings (Mikolov et al., 2018b), and ML-GAP+BERT, the parser that uses the bert-base-cased pretrained language model to compute token representations and finetunes it.
Token and sentence encoder The parsers differ in the way they represent the tokens (w 1 , w 2 , . . . w n ) in a sentence. The ML-GAP parser computes character-based word embeddings with a character bi-LSTM: (c 1 , . . . c n ), where c i = bi-LSTM(w i ), and concatenates them to word embeddings: ([c 1 , w 1 ], . . . [c n , w n ]). The ML-GAP+FT parser replaces learned word embeddings by (frozen) fast-text embeddings. The ML-GAP+BERT parser also uses a character bi-LSTM, but its output is concatenated with the contextualized embeddings from BERT: is the output of the last layer BERT for the corresponding tokens. When BERT segments a token into several subtokens, we use the vector corresponding to the first subtoken. Alternative methods are available (using the last subtoken or an aggregation of the subtoken vectors) but they do not seem to have an effect on parsing (Kitaev et al., 2019).
Then, the token embeddings are fed to a bi-LSTM sentence encoder, as usually done in parsing (Stanojević and Alhama, 2017;Corro, 2020;Stanojević and Steedman, 2020). In preliminary experiments, we alternatively used a self-attentive encoder (Vaswani et al., 2017), as done successfully in recent work in projective constituency parsing (Kitaev and Klein, 2018;Kitaev et al., 2019). However, it proved hard to optimize (high variance across experiments) and did not obtain better results than a bi-LSTM.  Action scorer and features We use two distinct feed-forward networks to score respectively structural actions (SHIFT, MERGE, GAP) and labelling actions (NO-LABEL, {LABEL-X | X is a non-terminal}). They both have an identical archi-  tecture and only differ in the number of units in the output layer. We use a single hidden layer with a tanh activation. We apply dropout to its input, and layer normalization (Ba et al., 2016) to the hidden layer. We use a softmax normalization to compute scores for possible output labels.
The choice of the structural or labelling classifier is entirely determined by the parsing configuration and depends on the type of the next action. The input to both classifiers is the concatenation of contextualized vectors extracted from a list of positions in the parsing configuration and specified as a list of feature templates.
In the ML-GAP transition system, a parsing configuration is defined by 3 data structures: a stack s containing subtrees, a double-ended queue d also containing subtrees, and a buffer b containing the yet unprocessed tokens. We use the following 11 templates: 7 • the left-most and right-most token of first and second element in s and d (8 templates in total: s 0 .l, s 0 .r, s 1 .l, s 1 .r, d 0 .l, d 0 .r, d 1 .l, d 1 .r); 8 • the next token in the buffer (b 0 ); • the contextualized embeddings corresponding to the start of sentence and end of sentence symbols.
Overall results We report overall development and test results in Table 1 (see Appendix A for details about training), and compare them to published results on the DPTB dataset. For a more comprehensive evaluation of the parser, including results on the Tiger (Brants et al., 2002) 9 and Negra (Skut et al., 1997) German corpora, we refer the reader to Table 7 of Appendix B. The ML-GAP setting improves over  by 0.4 and 3.1 respectively for the F and DF metrics, which we attribute to the hyperparameter search. In both the supervised setting and the 'pretrained embeddings' setting, our parser's results lag behind Corro (2020), the current state of the art. However it is noticeably more accurate on discontinuous constituents (more than 10 absolute DF difference).
In the BERT-finetuning setting, the F measure of the ML-GAP+BERT model slightly outperforms the span-based parser of Corro (2020     of BERT seems to cancel the benefits of the exact decoding permitted by the span-based approach of Corro (2020). On the DF metric, the gap is even larger (13.6 absolute difference). We attribute this difference to the fact that Corro (2020)'s parser is restricted to a certain type of discontinuities and cannot construct certain trees. Moreover, the transition-based paradigm enables a parser to use more fine-grained features than span-based parsers, which is particularly helpful for predicting discontinuous constituents.

Results and Discussion
In this section, we focus on the comparisons of our 3 models to assess the contribution of BERT to discontinuous parsing. We first focus on English, using the two resources we introduced in Sections 3 and 4. Then we provide and discuss results on German, using discosuite (Maier et al., 2014).

English
The improvements brought by BERT may come from its syntactic knowledge. However, they might also be a result of its extended lexical knowledge (providing more lexical information about out-ofvocabulary or rare words that might be known but do not take part in discontinuous structures in the training set). The ML-GAP+FT model provides a control setting, where the 'static' pretrained embeddings provide additional lexical information.
Overall effect of pretraining We provide detailed results (precision, recall, F) in Table 2. It had been reported that discontinuous parsers often have a large gap between precision (higher) and recall (lower) on discontinuities on both German and English corpora (Maier, 2015;Stanojević and Steedman, 2020;Corro, 2020). The use of BERT tends to fill this gap, with a much stronger effect on recall (+15 DR on development set over ml-gap) than on precision (+4 DP). BERT leads the parser to better detect syntactic discontinuities compared to a supervised model (ML-GAP). On the contrary, ML-GAP+FT provides only a small improvement over ML-GAP (+2.2 dev DF), which is split almost equally between precision (+2.0 DP) and recall (+2.5 DR). The striking difference between ML-GAP+FT and ML-GAP+BERT strongly suggests that BERT's contribution cannot be reduced to its extended lexical knowledge.
Per-phenomenon evaluation We report results on the test suite in Table 3, in the labelled case (upper part) and the unlabelled case (lower part). For each metric, we report the result of the ML-GAP+BERT model, as well as its absolute difference with, respectively the ML-GAP+FT and the ML-GAP model.
First, when comparing ML-GAP+BERT and ML-GAP, we observe a large improvement on all phenomena and almost all metrics. When comparing  labelled and unlabelled results, we observe very small differences (< 1) except for the case of circumpositioned quotations. This is due to some cases where the discontinuous quotation phrase has an unfrequent label (FRAG or SINV). Overall, subject inversions, 10 fronted quotations and circumpositioned quotations are almost perfectly detected by the ML-GAP+BERT system with DF scores over 95, and high exact match (at least in the unlabelled case for circumpositioning). On the other hand, discontinuous dependencies and it-extrapositions, and to a lower extent extractions, have DF scores below 90, despite the huge effect of BERT (respectively +35.2 and +25 absolute improvement on exact match for discontinuous dependencies and it-extrapositions). Secondly, the improvement brought by fast-text embeddings is consistently very small (around +2F), except on 2 types of phenomena: itextrapositions (where BERT does not improve over fast-text), and to a lower extent discontinuous dependencies (+6F for fast-text, +37.8F for BERT). This result suggests that the difficulty to parse these phenomena stemmed, at least partly from a lack of lexical knowledge.

Error Analysis
We report results of the error type classifier for both models in Table 4. For each error type, we report (i) the overall count of occurrences, (ii) the number of occurrences where the correction involved a discontinuous node among them, and (iii) the total number of nodes involved (a single error can cause multiple wrong nodes), as done by Kummerfeld et al. (2012). 10 Note that we have a small sample for this phenomenon (5 instances).
Overall, we observe an important decrease across all types of errors, with error reduction rates often close to 40% (e.g. 45% fewer occurrences of PP attachment errors) for ML-GAP+BERT and around 20% for ML-GAP+FT.
The picture is slightly different if we look at errors involving discontinuous constituents. Indeed, the use of BERT drastically reduces the main sources of errors (PP/VP/NP attachment, modifier attachment, coordination), while having no effect on other types of structure (NP internal structure, unary constituent, label). In contrast, fast-text only improves modifier and NP attachments, and even introduces clause attachment errors.

German
In order to provide additional context to our results on English, we further experiment with the same parsing models on German, using the test-suite built by Maier et al. (2014) on the German Tiger corpus. They constructed this test-suite by first identifying and classifying discontinuous phenomena in the 1500 first sentences of the Tiger corpus; and then they selected 15 11 sentences for each identified phenomenon. In total, discosuite contains 180 occurrences across 151 sentences. Each occurrence corresponds to a single discontinuous constituent.
We train our parsers on a modified version of the SPMRL Tiger split, where the 151 sentences are removed from the training set. We then parse the 151 sentences and use the labelled and unlabelled recall on target constituents to evaluate the corresponding phenomena. We provide results on the testsuite in Table 5, using the same settings as in English (ML-GAP, ML-GAP+FT, ML-GAP+BERT). We refer the reader to Maier et al. (2014) for the descriptions of specific phenomena. To the best of our knowledge, no prior parsing work used this testsuite for evaluation since its release.
Due to a finer-grain classification, there are only few instances for each type. Hence, we only comment on general patterns. Overall, fast-text provides small improvement on 6 types of phenomena (over 14). In constrast, BERT improves on every type of phenomenon, with largest increases for extrapositions of an element of a coordination, extrapositions involving a focus adverb (eg. adverb in the main clause modifying a subordinate clause), and local movement (involves discontinuities that Phenomenon Occurrences Labelled recall Unlabelled recall ML-GAP+BERT (∆) ML-GAP+FT (∆) ML-GAP ML-GAP+BERT (∆) ML-GAP+FT (∆) ML Table 5: Results of ML-GAP+BERT, ML-GAP+FT and ML-GAP on discosuite (Maier et al., 2014). Absolute difference with ML-GAP model is indicated in parentheses.
do not cross clause boundaries). These are also the most difficult phenomena to predict correctly (< 50 recall). We observe there is not a large difference between labelled and unlabelled scores, suggesting that finding the correct structure is the main difficulty.

Conclusion
We introduced two resources for fine-grained automatic error analysis of English discontinuous constituency parsers. First, we construct and release a test-suite for the range of syntactic phenomena responsible for the discontinuous structures in the discontinuous version of the Penn Treebank. Second, we extend the Berkeley parser analyser to the analysis of discontinuous constituency trees. We apply these resources to study the contribution of BERT to discontinuous parsing of English.
Overall, on almost all phenomena, BERT brings an improvement over a fast-text baseline. We found that BERT leads to almost perfect detection for some phenomena (subject inversion, fronted quotations, circumpositioned quotations). Moreover, there is still a wide room for improvement for extractions (despite the high frequency of this type of structures in the corpus), it-extrapositions, and discontinuous dependencies. In future work, we plan to address these limitations with targeted dataaugmentation methods. We also plan to evaluate other pretrained language models to assess whether they exhibit the same error patterns as BERT.  oracle that prioritizes merges over shifts when both are possible, this implicitly corresponds to a leftbinarization of n-ary constituents. Finally, we use greedy search for finding the best sequence of actions.