Transition-based Bubble Parsing: Improvements on Coordination Structure Prediction

We propose a transition-based bubble parser to perform coordination structure identification and dependency-based syntactic analysis simultaneously. Bubble representations were proposed in the formal linguistics literature decades ago; they enhance dependency trees by encoding coordination boundaries and internal relationships within coordination structures explicitly. In this paper, we introduce a transition system and neural models for parsing these bubble-enhanced structures. Experimental results on the English Penn Treebank and the English GENIA corpus show that our parsers beat previous state-of-the-art approaches on the task of coordination structure prediction, especially for the subset of sentences with complex coordination structures.


Introduction
Coordination structures are prevalent in treebank data (Ficler and Goldberg, 2016a), especially in long sentences (Kurohashi and Nagao, 1994), and they are among the most challenging constructions for NLP models. Difficulties in correctly identifying coordination structures have consistently contributed to a significant portion of errors in stateof-the-art parsers (Collins, 2003;Goldberg and Elhadad, 2010;Ficler and Goldberg, 2017). These errors can further propagate to downstream NLP modules and applications, and limit their performance and utility. For example, Saha et al. (2017) report that missing conjuncts account for two-thirds of the errors in recall made by their open information extraction system.
Coordination constructions are particularly challenging for the widely-adopted dependency-based paradigm of syntactic analysis, since the asymmetric definition of head-modifier dependency relations is not directly compatible with the symmetric 1 Code at github.com/tzshi/bubble-parser-acl21. nature of the relations among the participating conjuncts and coordinators. 2 Existing treebanks usually resort to introducing special relations to represent coordination structures. But, there remain theoretical and empirical challenges regarding how to most effectively encode information like modifier sharing relations while still permitting accurate statistical syntactic analysis.
In this paper, we explore Kahane's (1997) alternative solution: extend the dependency-tree representation by introducing bubble structures to explicitly encode coordination boundaries. The coheads within a bubble enjoy a symmetric relationship, as befits a model of conjunction. Further, bubble trees support representation of nested coordination, with the scope of shared modifiers identifiable by the attachment sites of bubble arcs. Figure 1 compares a bubble tree against a Universal Dependencies (UD; Nivre et al., , 2020) tree for the same sentence.
Yet, despite theses advantages, implementation of the formalism was not broadly pursued, for reasons unknown to us. Given its appealing and intuitive treatment of coordination phenomena, we revisit the bubble tree formalism, introducing and implementing a transition-based solution for parsing bubble trees. Our transition system, Bubble-Hybrid, extends the Arc-Hybrid transition system (Kuhlmann et al., 2011) with three bubble-specific transitions, each corresponding to opening, expanding, and closing bubbles. We show that our transition system is both sound and complete with respect to projective bubble trees (defined in § 2.2). Experiments on the English Penn Treebank (PTB; Marcus et al., 1993) extended with coordination annotation (Ficler and Goldberg, 2016a) and the English GENIA treebank (Kim et al., 2003) demonstrate the effectiveness of our proposed transition-based bubble parsing on the task of coordination structure prediction. Our method achieves state-of-the-art performance on both datasets and improves accuracy on the subset of sentences exhibiting complex coordination structures.

Dependency-based Representations for Coordination Structures
A dependency tree encodes syntactic relations via directed bilexical dependency edges. These are natural for representing argument and adjunct modification, but Popel et al. (2013) point out that "dependency representation is at a loss when it comes to representing paratactic linguistic phenomena such as coordination, whose nature is symmetric (two or more conjuncts play the same role), as opposed to the head-modifier asymmetry of dependencies" (pg. 517). If one nonetheless persists in using dependency relations to annotate all syntactic structures, as is common practice in most dependency treebanks (Hajič et al., 2001;Nivre et al., 2016, inter alia), then one must introduce special relations to represent coordination structures and promote one element from each coordinated phrase to become the "representational head". One choice is to specify one of the conjuncts as the "head" (Mel'čuk, 1988(Mel'čuk, , 2003Järvinen and Tapanainen, 1998;Lombardo and Lesmo, 1998) (e.g., in Figure 1, the visually asymmetric "conj" relation between "coffee" and "tea" is overloaded to admit a symmetric relationship), but it is then non-trivial to distinguish shared modifiers from private ones (e.g., in the UD tree at the bottom of Figure 1, it is difficult to tell that "hot" is private to "coffee" and "tea", which share it, but "hot" does not modify "bun"). Another choice is let one of the coordinators dominate the phrase (Hajič et al., 2001(Hajič et al., , 2020, but the coordinator does not directly capture the syntactic category of the coordinated phrase. Decisions on which of these dependency-based fixes is more workable are further complicated by the interaction between representation styles and their learnability in statistical parsing (Nilsson et al., 2006;Johansson and Nugues, 2007;Rehbein et al., 2017).
Enhanced UD A tactic used by many recent releases of UD treebanks is to introduce certain extra edges and non-lexical nodes Nivre et al., 2018;Bouma et al., 2020). While some of the theoretical issues still persist in this approach with respect to capturing the symmetric nature of relations between conjuncts, this solution better represents shared modifiers in coordinations, and so is a promising direction. In work concurrent with our own, Grünewald et al. (2021) manually correct the coordination structure annotations in an English treebank under the enhanced UD representation format. We leave it to future work to explore the feasibility of automatic conversion of coordination structure representations between enhanced UD trees and bubble trees, which we discuss next.

Bubble Trees
An alternative solution to the coordination-independency-trees dilemma is to permit certain restricted phrase-inspired constructs for such structures. Indeed, Tesnière's (1959) seminal work on dependency grammar does not describe all syntactic relations in terms of dependencies, but rather reserves a primitive relation for connecting coordinated items. Hudson (1984) further extends this idea by introducing explicit markings of coordination boundaries.
In this paper, we revisit bubble trees, a representational device along the same vein introduced by Kahane (1997) for syntactic representation. (Kahane credits Gladkij (1968) with a formal study.) Bubbles are used to denote coordinated phrases; otherwise, asymmetric dependency relations are retained. Conjuncts immediately within the bubble may co-head the bubble, and the bubble itself may establish dependencies with its governor and modifiers. Figure 1 depicts an example bubble tree.
We now formally define bubble trees and their projective subset, which will become the focus of our transition-based parser in §3. The following formal descriptions are adapted from Kahane (1997), tailored to the presentation of our parser.
Formal Definition Given a dependency-relation label set L, we define a bubble tree for a lengthn sentence W = w 1 , . . . , w n to be a quadruple (V, B, φ, A), where V = {RT, w 1 , . . . , w n } is the ground set of nodes (RT is the dummy root), B is a set of bubbles, the function φ : B → (2 V \{∅}) gives the content of each bubble as a non-empty 3 subset of V , and A ⊂ B × L × B defines a labeled directed tree over B. Given labeled directed tree A, we say α 1 → α 2 if and only if (α 1 , l, α 2 ) ∈ A for some l. We denote the reflexive transitive closure of relation → by *

→.
Bubble tree (V, B, φ, A) is well-formed if and only if it satisfies the following conditions: 4 • No partial overlap: • Non-duplication: there exists no non-identical α 1 , α 2 ∈ B such that φ(α 1 ) = φ(α 2 ); • Lexical coverage: for any singleton (i.e., oneelement) set s in 2 V , ∃α ∈ B such that φ(α) = s; • Roothood: the root RT appears in exactly one bubble, a singleton that is the root of the tree defined by A.
Projectivity Our parser focuses on the subclass of projective well-formed bubble trees. Visually, a projective bubble tree only contains bubbles covering a consecutive sequence of words (such that we can draw boxes around the span of words to represent them) and can be drawn with all arcs arranged spatially above the sentence where no two arcs or bubble boundaries cross each other. The bubble tree in Figure 1 is projective.
Formally, we define the projection ψ(α) ∈ 2 V of a bubble α ∈ B to be all nodes the bubble and its subtree cover, that is, v ∈ ψ(α) if and only if α * → α and v ∈ φ(α ) for some α . Then, we can define a well-formed bubble tree to be projective if and only if it additionally satisfies the following: • Continuous coverage: for any bubble α ∈ B, if w i , w j ∈ φ(α) and i < k < j, then w k ∈ φ(α); 3 Our definition does not allow empty nodes; we leave it to future work to support them for gapping constructions. 4 We do not use β for bubbles because we reserve the β symbol for our parser's buffer.

Our Transition System for Parsing Bubble Trees
Although, as we have seen, bubble trees have theoretical benefits in representing coordination structures that interface with an overall dependencybased analysis, there has been a lack of parser implementations capable of handling such representations. In this section, we fill this gap by introducing a transition system that can incrementally build projective bubble trees. Transition-based approaches are popular in dependency parsing (Nivre, 2008;Kübler et al., 2009). We propose to extend the Arc-Hybrid transition system (Kuhlmann et al., 2011) with transitions specific to bubble structures. 5

Bubble-Hybrid Transition System
A transition system consists of a data structure describing the intermediate parser states, called configurations; specifications of the initial and terminal configurations; and an inventory of transitions that advance the parser in configuration space towards reaching a terminal configuration.
Our transition system uses a similar configuration data structure to that of Arc-Hybrid, which consists of a stack, a buffer, and the partiallycommitted syntactic analysis. Initially, the stack only contains a singleton bubble corresponding to {RT}, and the buffer contains singleton bubbles, each representing a token in the sentence. Then, through taking transitions one at a time, the parser can incrementally move items from the buffer to the stack, or reduce items by attaching them to other bubbles or merging them into larger bubbles. Eventually, the parser should arrive at a terminal configuration where the stack contains the singleton bubble of {RT} again, but the buffer is empty as all the tokens are now attached to or contained in other bubbles that are now descendants of the Step-by-step visualization of the stack and buffer during parsing of the example sentence in Figure 1. For steps following an attachment or BUBBLECLOSE transition, the detailed subtree or internal bubble structure is omitted for visual clarity. For the same reason, we omit drawing the boundaries around singleton bubbles.
{RT} singleton, and we can retrieve a completed bubble-tree parse. Table 1 lists the available transitions in our Bubble-Hybrid system. The SHIFT, LEFTARC, and RIGHTARC transitions are as in the Arc-Hybrid system. We introduce three new transitions to handle coordination-related bubbles: BUBBLEOPEN puts the first two items on the stack into an open bubble, with the first item in the bubble, i.e., previously the second topmost item on the stack, labeled as the first conjunct of the resulting bubble; BUBBLEAT-TACH absorbs the topmost item on the stack into the open bubble that is at the second topmost position; and finally, BUBBLECLOSE closes the open bubble at the top of the stack and moves it to the buffer, which then allows it to take modifiers from its left through LEFTARC transitions. Figure 2 visualizes the stack and buffer throughout the process of parsing the example sentence in Figure 1. In particular, the last two steps in the left column of Figure 2 show the bubble corresponding to the phrase "coffee or tea" receiving its left modifier "hot" through a LEFTARC transition after it is put back on the buffer by a BUBBLECLOSE transition.
Formal Definition Our transition system is a quadruple (C, T, c i , C τ ), where C is the set of configurations to be defined shortly, T is the set of transitions with each element being a partial function t ∈ T : C C, c i maps a sentence to its intial configuration, and C τ ⊂ C is a set of terminal configurations. Each configuration c ∈ C is a septuple (σ, β, V, B, φ, A, O), where V , B, φ, and A define a partially-recognized bubble tree, σ and β are each an (ordered) list of items in B, and O ⊂ B is a set of open bubbles. For a sentence W = w 1 , . . . , w n , We write σ|s 1 and b 1 |β to denote a stack and a buffer with their topmost items being s 1 and b 1 and the remainders being σ and β respectively. We also omit the constant V in describing c when the context is clear.
For the transitions T , we have: , φ is almost the same as φ, but with α added to the function's domain, mapped by the new function to cover the projections of both s 2 and s 1 );

Soundness and Completeness
In this section, we show that our Bubble-Hybrid transition system is both sound and complete (defined below) with respect to the subclass of projective bubble trees. 6 Define a valid transition sequence π = t 1 , . . . , t m for a given sentence W to be a sequence such that for the corresponding sequence of con- , and c m ∈ C τ , We can then state soundness and completeness properties, and present highlevel proof sketches below, adapted from Nivre's (2008) proof frameworks.
Lemma 1. (Soundness) Every valid transition sequence π produces a projective bubble tree. Proof Sketch. We examine the requirements for a projective bubble tree one by one. The set of edges satisfies the tree constraints since every bubble except for the singleton bubble of RT must have an in-degree of one to have been reduced from the stack, and the topological order of reductions implies acyclicness. Lexical coverage is guaranteed by c i . Roothood is safeguarded by the transition pre-conditions. Non-duplication is ensured because newly-created bubbles are strictly larger. All the other properties can be proved by induction over the lengths of transition sequence prefixes since each of our transitions preserves zero partial overlap, containment, and projectivity constraints.
Lemma 2. (Completeness) For every projective bubble tree over any given sentence W , there exists a corresponding valid transition sequence π. Proof Sketch. The proof proceeds by strong induction on sentence length. We omit relation labels without loss of generality. The base case of |W | = 1 is trivial. For the inductive step, we enumerate how to decompose the tree's top-level structure. (1) When the root has multiple children: Due to projectivity, each child bubble tree τ i covers a consecutive span of words w x i , . . . , w y i that are shorter than |W |. Based on the induction hypothesis, there exisits a valid transition sequence π i to construct the child tree over RT, w x i , . . . , w y i . Here we let π i to denote the transition sequence excluding the always-present final RIGHTARC transition that attaches the subtree to RT; this is for explicit illustration of what transitions to take once the subtrees are constructed. The full tree can be constructed by π = π 1 , RIGHTARC, π 2 , RIGHTARC, . . . (expanding each π i sequence into its component transitions), where we simply attach each subtree to RT immediately after it is constructed. (2) When the root has a single child bubble α, we cannot directly use the induction hypothesis since α covers the same number of words as W . Thus we need to further enumerate the top-level structure of α. (2a) If α has children with their projections outside of φ(α), then we can find a sequence π 0 for constructing the shorter-length bubble α and placing it on the buffer (this corresponds to an empty transition sequence if α is a singleton; otherwise, π 0 ends with a BUBBLECLOSE transition) and π i s for α's outside children; say it has l children left of its contents. We construct the entire tree via π = π 1 ,. . . ,π l , π 0 , LEFTARC, . . . , LEFT-ARC, SHIFT, π l+1 , RIGHTARC, . . . , RIGHTARC, where we first construct all the left outside children and leave them on the stack, next build the bubble α and use LEFTARC transitions to attach its left children while it is on the buffer, then shift α to the stack before finally continuing on building its right children subtrees, each immediately followed by a RIGHTARC transition. (2b) If α is a non-singleton bubble without any outside children, but each of its inside children can be parsed through π i based on the inductive hypothesis, then we can define π = π 1 ,π 2 , BUBBLEOPEN, π 3 , BUBBLEAT-TACH, . . . , BUBBLECLOSE, SHIFT, RIGHTARC, where we use a BUBBLEOPEN transition once the first two bubble-internal children are built, each subsequent child is attached via BUBBLEATTACH immediately after construction, and the final three transitions ensure proper closing of the bubble and its attachment to RT.

Models
Our model architecture largely follows that of Kiperwasser and Goldberg's (2016) neural Arc-Hybrid parser, but we additionally introduce feature composition for non-singleton bubbles, and a rescoring module to reduce frequent coordinationboundary prediction errors. Our model has five components: feature extraction, bubble-feature composition, transition scoring, label scoring, and boundary subtree rescoring.

Feature Extraction
We first extract contextualized features for each token using a bidirectional LSTM (Graves and Schmidhuber, 2005): where the inputs to the bi-LSTM are concatenations of word embeddings, POS-tag embeddings, and character-level LSTM embeddings. We also report experiments replacing the bi-LSTM with pre-trained BERT features (Devlin et al., 2019).

Bubble-Feature Composition
We initialize the features 7 for each singleton bubble B i in the initial configuration to be v B i = w i . For a non-singleton bubble α, we use recursively composed features where g is a composition function combining features from the co-heads (conjuncts) immediately inside the bubble. 8 For our model, for any V = {v i 1 , . . . , v i N }, we set where mean() computes element-wise averages and W g is a learnable square matrix. We also experiment with a parameter-free version: g = mean.
Neither of the feature functions distinguishes between open and closed bubbles, so we append to each v vector an indicator-feature embedding based on whether the bubble is open, closed, or singleton.
Transition Scoring Given the current parser configuration c, the model predicts the best unlabeled transition to take among all valid transitions valid(c) whose pre-conditions are satisfied. We 7 We adopt the convenient abuse of notation of allowing indexing by arbitrary objects. 8 Comparing with the subtree-feature composition functions in dependency parsing that are motivated by asymmetric headed constructions (Dyer et al., 2015;de Lhoneux et al., 2019;Basirat and Nivre, 2021), our definition focuses on composing features from an unordered set of vectors representing the conjuncts in a bubble. The composition function is recursively applied when there are nested bubbles. model the log-linear probability of taking an action with a multi-layer perceptron (MLP): where • denotes vector concatenation, s 1 through s 3 are the first through third topmost items on the stack, and b 1 is the immediately accessible buffer item. We experiment with varying the number of stack items to extract features from.
Label Scoring We separate edge-label prediction from (unlabeled) transition prediction, but the scoring function takes a similar form: is the edge to be added into the partial bubble tree in t(c).
Boundary Subtree Rescoring In our preliminary error analysis, we find that our models tend to make more mistakes at the boundaries of full coordination phrases than at the internal conjunct boundaries, due to incorrect attachments of children choosing between the phrasal bubble and the first/last conjunct. For example, our initial model predicts "if you owned it and liked it Friday" instead of the annotated "if you owned it and liked it Friday" (the predicted and gold conjuncts are both italicized and underlined), incorrectly attaching "Friday" to "liked". We attribute this problem to the greedy nature of our first formulation of the parser, and propose to mitigate the issue through rescoring. To rescore boundary attachments of a non-singleton bubble α, for each of the left dependents d of α and its first conjunct α f , we (re)-decide the attachment via and similarly for the last conjunct α l and a potential right dependent.
Training and Inference Our parser is a locallytrained greedy parser. In training, we optimize the model parameters to maximize the log-likelihoods of predicting the target transitions and labels along the paths generating the gold bubble trees, and the log-likelihoods of the correct attachments in rescoring; 9 during inference, the parser greedily commits to the highest-scoring transition and label for each of its current parser configurations, and after reaching a terminal configuration, it rescores and readjusts all boundary subtree attachments. 9 We leave the definition of dynamic oracles (Goldberg and Nivre, 2013) for bubble tree parsing to future work.

Empirical Results
Task and Evaluation We validate the utility of our transition-based parser using the task of coordination structure prediction. Given an input sentence, the task is to identify all coordination structures and the spans for all their conjuncts within that sentence. We mainly evaluate based on exact metrics which count a prediction of a coordination structure as correct if and only if all of its conjunct spans are correct. To facilitate comparison with pre-existing systems that do not attempt to identify all conjunct boundaries, following Teranishi et al. (2017Teranishi et al. ( , 2019, we also consider inner (=only consider the correctness of the two conjuncts adjacent to the coordinator) and whole (=only consider the boundary of the whole coordinated phrase) metrics.

Data and Experimental Setup
We experiment with two English datasets, the Penn Treebank (PTB; Marcus et al., 1993, newswire) with added coordination annotations (Ficler and Goldberg, 2016a) and the GENIA treebank (Kim et al., 2003, research abstracts). We use the conversion tool distributed with the Stanford Parser  to extract UD trees from the PTBstyle phrase-structure annotations, which we then merge with coordination annotations to form bub-   ble trees. We follow prior work in reporting PTB results on its standard splits and GENIA results using 5-fold cross-validation. 10 During training (but not test), we discard all non-projective sentences. See Appendix A for dataset pre-processing and statistics and Appendix B for implementation details.
Baseline Systems We compare our models with several baseline systems. Hara et al. (2009, HSOM09) use edit graphs to explicitly align coordinated conjuncts based on the idea that they are usually similar; Ficler and Goldberg (2016b, FG16) score candidate coordinations extracted from a phrase-structure parser by modeling their symme- 10 We affirm that, as is best practice, only two testset/crossval-suite runs occurred (one with BERT and one without), happening after we fixed everything else; that is, no other models were tried after seeing the first test-set/cross-validation results with and without BERT.  try and replaceability properties; Teranishi et al. (2017, TSM17) directly predict boundaries of coordinated phrases and then split them into conjuncts; 11 Teranishi et al. (2019, TSM19) use separate neural models to score the inner and outer boundaries of conjuncts relative to the coordinators, and then use a chart parser to find the globallyoptimal coordination structures. Table 2 and Table 3 show the main evaluation results on the PTB and GENIA datasets. Our models surpass all prior results on both datasets. While the BERT improvements may not seem surprising, we note that Teranishi et al. (2019) report that their pre-trained language models -specifically, static ELMo embeddings -fail to improve their model performance.

General Parsing Results
We also evaluate our models on standard parsing metrics by converting the predicted bubble trees to UD-style dependency trees. On PTB, our parsers reach unlabeled and labeled attachment scores (UAS/LAS) of 95.81/94.46 with BERT and 94.49/92.88 with bi-LSTM, which are similar to the scores of prior transition-based parsers equipped with similar feature extractors (Kiperwasser and Goldberg, 2016; Mohammadshahi and Henderson, 2020). 12 Table 4 compares the general parsing results of our bubble parser and an edge-factored graph-based dependency parser based on Dozat and Manning's (2017) parser architecture and the same feature encoder as our parser and trained on the same data. Our bubble parser shows a slight improvement on identifying the "conj" relations, despite having a lower overall accuracy due to the greedy nature of our transition-based decoder. Additionally, our 11 We report results for the extended model of TSM17 as described by Teranishi et al. (2019). 12 Results are not strictly comparable with previous PTB evaluations that mostly focus on non-UD dependency conversions. Table 4 makes a self-contained comparison using the same UD-based and coordination-merged data conversions. bubble parser simultaneously predicts the boundaries of each coordinated phrase and conjuct, while a typical dependency parser cannot produce such structures. Table 5 shows results of our models with alternative bubble-feature composition functions and varying feature-set sizes. We find that the parameterized form of composition function g performs better, and the F1 scores mostly degrade as we use fewer features from the stack. Interestingly, the importance of our rescoring module becomes more prominent when we use fewer features. Our results resonate with Shi et al.'s (2017) findings on Arc-Hybrid that we need at least one stack item but not necessarily two. Table 6 shows that our model performs better than previous methods on complex sentences with multiple coordination structures and/or more than two conjuncts, especially when we use BERT as feature extractor.

Related Work
Coordination Structure Prediction Very early work with heuristic, non-learning-based approaches (Agarwal and Boggess, 1992;Kurohashi and Nagao, 1994) typically report difficulties in distinguishing shared modifiers from private ones, although such heuristics have been recently incorporated in unsupervised work (Sawada et al., 2020). Generally, researchers have focused on symmetry principles, seeking to align conjuncts (Kurohashi and Nagao, 1994;Shimbo and Hara, 2007;Hara et al., 2009;Hanamoto et al., 2012), since coordinated conjuncts tend to be semantically and syntactically similar (Hogan, 2007), as attested to by psycholinguistic evidence of structural parallelism (Frazier et al., 1984(Frazier et al., , 2000Dubey et al., 2005). Ficler and Goldberg (2016a) and Teranishi et al. (2017) additionally leverage the linguistic principle of replaceability -one can typically replace a coordinated phrase with one of its conjuncts without the sentence becoming incoherent; this idea has resulted in improved open information extraction (Saha and Mausam, 2018). Using these principles may further improve our parser.

Coordination in Constituency Grammar
While our paper mainly focuses on enhancing dependency-based syntactic analysis with coordination structures, coordination is a well-studied topic in constituency-based syntax (Zhang, 2009), including proposals and treatments under lexical functional grammar (Kaplan and Maxwell III, 1988), tree-adjoining grammar (Sarkar and Joshi, 1996;Han and Sarkar, 2017), and combinatory categorial grammar (Steedman, 1996(Steedman, , 2000. Tesnière Dependency Structure Sangati and Mazza (2009) propose a representation that is faithful to Tesnière's (1959) original framework. Similar to bubble trees, their structures include special attention to coordination structures respecting conjunct symmetry, but they also include constructs to handle other syntactic notions currently beyond our parser's scope. 13 Such representations have been used for re-ranking (Sangati, 2010), but not for (direct) parsing. Perhaps our work can inspire a future Tesnière Dependency Structure parser.
Non-constituent Coordination Seemingly incomplete (non-constituent) conjuncts are particularly challenging (Milward, 1994), and our bubble parser currently has no special mechanism for them. Dependency-based analyses have adapted by extending to a graph structure (Gerdes and Kahane, 2015) or explicitly representing elided elements (Schuster et al., 2017). It may be straightforward to integrate the latter into our parser, à la Kahane's (1997) proposal of phonologically-empty bubbles.

Conclusion
We revisit Kahane's (1997) bubble tree representations for explicitly encoding coordination boundaries as a viable alternative to existing mechanisms in dependency-based analysis of coordination structures. We introduce a transition system that is both sound and complete with respect to the subclass of projective bubble trees. Empirically, our bubble parsers achieve state-of-the-art results on the task of coordination structure prediction on two English datasets. Future work may extend the research scope to other languages, graph-based, and non-projective parsing methods.  800 steps for GENIA, based on their training set sizes), we perform an evaluation on the dev set. If the dev set performance fails to improve within 5 consecutive evaluation rounds, we multiply the learning rate by 0.1. We terminate model training when the learning rate has dropped three times, and select the best model checkpoint based on dev set F1 scores according to the "exact" metrics. 14 For the BERT feature extractor, we finetune the pretrained case-sensitive BERT base model through the transformers package. 15 For the non-BERT model, we use pre-trained GloVe embeddings (Pennington et al., 2014). Following prior practice, we embed gold POS tags as input features when using bi-LSTM for the models trained on the GENIA dataset, but we omit the POS tag embeddings for the PTB dataset.
The training process for each model takes roughly 10 hours using an RTX 2080 Ti GPU; model inference speed is 41.9 sentences per second. 16 We select our hyperparameters by hand. Due to computational constraints, our hand-tuning has been limited to setting the dropout rates, and from the candidates set of {0.0, 0.1, 0.3, 0.5} we chose