Improving the Extraction of Supertags for Constituency Parsing with Linear Context-Free Rewriting Systems

,


Introduction
Discontinuous constituency parsing deals with the task to find hierarchies of -possibly noncontiguous -phrases (constituents) in a given sentence and assigns a label (constituent symbol) to each phrase.Traditional approaches use grammar formalisms such as linear context-free rewriting systems (LCFRS) to model these hierarchies (Maier and Søgaard, 2008;Kallmeyer and Maier, 2013;van Cranenburgh et al., 2016;Gebhardt, 2020).Statistical parsing with these grammars is remarkably slow and inaccurate by today's standards.But they still find some attraction as both, the grammars and parsing with them, are easily interpretable.More recent parsers use neural classifiers and either leverage the parsing process into a linear task (Coavoux, 2021;Fernández-González andGómez-Rodríguez, 2021b,a, 2022) or score constituent labels for selected phrases (Corro, 2020;Stanojević and Steedman, 2020).Although, in the latter approaches, the occurring discontinuities are restricted to a small degree.
In supertagging-based parsing (Bangalore and Joshi, 1999), a grammar is accompanied by a classifier that selects and scores a small sample of rules.After that, these rules and their scores are interpreted as a weighted grammar and used for statistical parsing in the usual manner.This remaining statistical parsing process is significantly faster than the traditional approach, as the grammar is much smaller.The approach was investigated in combination with tree adjoining grammars (TAG; Kasai et al., 2017;Bladier et al., 2018) and combinatory categorial grammars (CCG; Clark, 2002;Kadari et al., 2018), often in the context of dependency parsing.Since the introduction of recurrent neural networks (RNN) as classifiers, the accuracy of the supertag prediction has improved by far (Vaswani et al., 2016;Kasai et al., 2017;Bladier et al., 2018;Kadari et al., 2018).A recent publication (Ruprecht and Mörbitz, 2021) showed that supertagging improves the quality and speed of parsing constituency structures with LCFRS, bringing it close to recent discontinuous parsing methods.However, their extraction process for supertags is rather convoluted and uses hard-wired strategies for, e.g., the lexicalization and binarization.In this work, we investigate if the quality of predictions and parsing with LCFRS supertags can be improved by introducing parameters that replace these hard-wired configurations in the extraction.
Section 3 presents a formulation of supertags and an extraction algorithm that is -in our eyes -easier to grasp than the previous definition, be- Figure 1: Discontinuous constituent tree for the phrase where the survey was carried out.The tree is illustrated with crossing branches, so that the leaves appear ordered.For each constituent, the path to its lexical head is double-struck.
cause it avoids cumbersome transformations of LCFRS derivations.Both tackle the following limitations of the existing approach: (i) Before the extraction of grammar rules, the constituent trees were binarized using specific fixed parameters.We will investigate the impact of varying those parameters to the processes for extraction and parsing.(ii) Constructing lexical LCFRS rules picked a sentence position for each inner node of the constituent tree according to a fixed strategy.We will investigate multiple such strategies, which we will call guide constructors.(iii) LCFRS rules were constructed with constituent symbols as nonterminals, which were then supplemented with annotations during the lexicalization process.The authors noted that the sets of extracted supertags are rather large compared to other approaches, and we deem that the granularity of nonterminals plays a significant role in this issue.We decouple the nonterminals from the other extraction processes and introduce multiple strategies to define them, called nonterminal constructors.Section 4 describes experiments with the discontinuous English Penn Treebank (DPTB, Marcus et al., 1994;Evang and Kallmeyer, 2011), and the two German treebanks NeGra (Skut et al., 1998) and Tiger (Brants et al., 2004).It explains how we found viable configurations for the introduced parameters and gives results for parsing with them.The implementation is available as free software.1

Notation
A discontinuous constituent tree is a tuple (ξ, pos, w) as follows: Figure 2: A binary lexical LCFRS derivation for the string where the study was carried out.
nal symbols (phrase), pos is a sequence of partof-speech (pos) symbols with the same length as w, and the constituent structure ξ is a tree; its inner nodes are constituent symbols and its leaves are positions 0, . . ., |w| − 1 such that each occurs exactly once in ξ. Figure 1 shows an example.We use the usual notation for (Gorn-)positions2 in the constituent structure, i.e. each position determines exactly one node in ξ.The set of all inner node positions in ξ is denoted by npos(ξ).The subtree of ξ at position ρ is denoted by ξ| ρ .The yield yd(ξ) is the set of leaves in ξ.The fanout of (a set of leaves) L is the smallest number of contiguous subsets of L. For instance, in Fig. 1, the yield of the subtree governed by the upper node labeled by VP is the set {0, 3, 4, 5}, its fanout is 2. ξ(ρ) denotes the constituent symbol at position ρ.
We sometimes consider lexical heads for the constituent structure.For each inner node, the lexical head is the critical sentence position that determines its symbol; and we call each of the node's children that does not contain the lexical head a modifier.In Fig. 1, for each inner node in ξ, the path to its lexical head is double struck.E.g. for the inner node with symbol S, the lexical head is the position 3.The subtree starting at the NP node is its only modifier.
We briefly cover the notation for binary lexical LCFRS (in the following just LCFRS).strings from the lexical symbol and n (respectively n + m in the binary case) argument strings.

Each rule is either of the form
An LCFRS derivation is a tree where each node is a rule such that its number of successors matches the rule's arity and its rhs nonterminals match the successors' lhs nonterminals; Fig. 2 shows an example.The strings produced by a derivation are obtained recursively for each node from bottom to the top as follows: • If the node is a nullary rule A → [w] and has no successors, then the produced string is w.
• If the node is either a unary or binary rule of the form A → [u 1 , . . ., u k ]( ⃗ B) with | ⃗ B| successors, then it produces k strings which are obtained from u 1 , . . ., u k by replacing each variable x i with the i-th string produced by the first and y j with the j-th string produced by the second successor.

Contributions
This section presents all concepts involved with the processes for the extraction from (discontinuous) constituency treebanks and parsing with LCFRS supertags.We will start in Section 3.1 with a short motivation for our notation for supertags, which deviates from the previous formulation.Section 3.2 describes our process for the extraction of supertags and highlights parts that require sets of parameters.These parameters are described in detail in the following Section 3.3.Section 3.4 concludes with an overview for the parsing process.All the described steps and concepts are illustrated along the arrows in Fig. 3, their labeling conforms with the numbering in the para-graph headings of Sections 3.2 and 3.4.

Supertags
Ruprecht and Mörbitz (2021) introduced LCFRS supertags as rules with certain annotations that do not fit into the usual framework but are necessary to convert derivations into constituent trees.Moreover, their notation is closely tied to the extraction and parsing pipeline, which, as explained in Section 1, assumes some fixed choices that we establish as hyperparameters.To tackle these limitations, we introduce the following notation for supertags: a supertag is a tuple (r, t, c, p) where • r is an LCFRS rule, its terminal is a wildcard, • t is None or an index in {1, 2} tracking the transformations for the extraction of lexical rules, 3• c is either None or a constituent symbol4 , and • p is a pos symbol.5

Extraction Process
The process described in this section is used to extract a sequence of supertags, one for each word in a sentence, from a given constituent tree.We distinguish four steps (i-iv) which are executed consecutively.The parameters for steps (i-iii) are described in detail in the next section.(i) Binarization.We construct a binary constituent tree with the usual strategies in constituent parsing (Kallmeyer and Maier, 2013): Each unary node is merged with its child (or pos symbol, if the child is a leaf), and nodes with arity n > 2 are split into n − 1 binary nodes according to the parameters described in Section 3.3.After this step, the constituent tree for a sentence w is equipped with |w| − 1 inner nodes.Figure 4 shows a binary tree resulting from binarization of the tree in Fig. 1.
(ii) Guide.In this step, we define a guide for the binary constituent structure ξ, i.e. a mapping G between inner node positions and leaves in the constituent structure.In the following step, a lexical LCFRS rule will be constructed for the constituent at each inner node and the assigned leaf.Intuitively, the guide determines which sentence position is "responsible" for the constituent at each position.Formally, a guide for ξ is an injective function G : npos(ξ) → yd(ξ) such that, for each ρ ∈ npos(ξ), the assigned leaf G(ρ) is in yd(ξ| ρ ). Figure 4 shows a guide for our example constituent structure, assigning a leaf (illustrated in gray circles) to each inner node.As G is injective and there is one less inner node than leaves in ξ, there is exactly one leaf that is not in the image of G.We will investigate multiple strategies, called guide constructors, to define guides for a given constituent tree as discussed in Section 3.3.
(iii) Lexical rule induction.In this step, we will construct a lexical LCFRS rule r ′ and the components c, t and p of the supertag for each leaf in the constituent structure.The nonterminals in the rule are determined by a chosen hyperparameter NT, called the nonterminal constructor, in terms of the constituent symbol and the guide G as described in Section 3.3.We give examples for (the root position in) the tree in Fig. 4, the guide values shown in pentagons (shortest guide constructor) and constituent symbols as nonterminals (classic nonterminal constructor).We start with the leaf i ′ that is not in the image of G and define a new nonterminal L-A using the fixed string "L-" and the nonterminal A produced by NT for the parent of i ′ .In that case c = t = None, p = pos(i ′ ), and , and p = NN).
After that, we define the following for each position ρ ∈ npos(ξ) (from bottom up) and its assigned leaf i = G(ρ): • The LCFRS rule r ′ is assembled in the usual manner from i as lexical symbol and variables for the spans formed by the leaves in each successor except those leaves that are assigned by G to ρ's ancestors.NT produces the lhs nonterminal for ρ, the rhs nonterminals are the lhs nonterminals constructed for ρ's children.In our example, r is assembled from the lexical symbol 3, the left successor's leaves are {0 (x 1 ) , 4, 5 (x 2 ) } and the right one's are {1, 2 (y 1 ) }; hence r = S → • t is None if the leaf G(ρ) is a direct child of ρ, otherwise it is the index among the children where G(ρ) is located.In our example, 3 is not a child of the root, it is in its first successor; therefore t = 1.
• p is the pos symbol at G(ρ) in pos.In our example p = VBD.
(iv) Supertag extraction.The tuples constructed in the previous step closely resemble supertags as described in the previous subsection.We pull them from the constructed tree and order them according to the sentence position included in the lexical rule.Lastly, the sentence position in each rule is replaced by a wildcard symbol " ".

Extraction parameters
Apart from the constituent treebank, our extraction algorithm expects parameters for binarization, a guide constructor and a nonterminal constructor.The vanilla parameters coincide with the existing algorithm.Binarization parameters.During binarization, we distinguish factorization (the direction in which new nodes are introduced) from left to right (lr bin) or head-outward (ho bin, mixes left and right-branching nodes such that the node's lexical head is among at the last one). 6Both strategies are extended by markovization.The width of the horizontal markovization window is denoted by h, the vertical one by v.
Guide constructors.We define guides for given constituent structures using the following strategies.Figures 4 and 5 show the leaf assigned to each inner node for each guide constructor in an example constituent structure.
• vanilla: The guide maps each node position either to the leftmost leaf that is a direct successor, or (if not available) to the leftmost leaf in the yield of its right successor.The assignment is determined for each node from top to bottom.
• strict: The guide maps each node position to the leftmost leaf in the yield of its right successor.
• modifier: The guide maps each position to its modifier's lexical head.This guide requires that the constituents structures are binarized headoutward, which guarantees that each inner node has exactly one modifier.
• least: The guide aims to map many positions to leafs that are direct children.The guide is determined for each position from bottom to the top and selects the nearest (and leftmost, if ambiguous) leaf for each position.
• shortest: The guide aims to map many positions to leaves that are as near as possible.The guide is determined for each position from top to bottom and, similar to the least guide, selects the nearest (and leftmost, if ambiguous) leaf for each position.When searching for the nearest leaf, we exclude a subtree if a leaf in it was selected previously.
• vanilla: The nonterminal consists of the symbol ξ(ρ), the fanout fo(yd(ξ| ρ )) as subscript, and if L contains any leaf in yd(ξ| ρ ), then the difference in fanout fo(yd(ξ| ρ ) \ L) − fo(yd(ξ| ρ )) as superscript.This superscript indicates the difference in fanout at ρ in the original constituent tree compared to the leaves assigned to the nodes in the subtree at ρ by the guide.(At most one leaf is assigned to an ancestor of ρ.)In our two examples, we construct the nonterminals SBAR+S and VP −1 2 • classic: The nonterminal consists of the first symbol7 in ξ(ρ) (including markers introduced during binarization) and the fanout fo(yd(ξ| ρ )\L) as subscript.This constructor omits annotations that depend on the guide and is more akin to usual strategies in LCFRS extraction (Maier and Søgaard, 2008).For our examples, the nonterminals are SBAR and VP.
• coarse: Like the classic nonterminals, but we replace the constituent symbols occurring in the treebank by their first letter.This is a very rough approximation of nonterminals in coarse-to-fine parsing (Charniak et al., 2006) that does not need a specific clustering for each treebank.For our examples, the nonterminals are S and V.

Parsing
For parsing, a small sample of supertags is predicted for each position in the input.Each tag is equipped with a score expressing its prediction confidence among all supertags for this position.
(v) Supertag derivations.For each predicted tag, the wildcard in the LCFRS rule is replaced by the input position it was predicted for.The sequence of input positions is parsed using the set of all lexical LCFRS rules inside the supertags.We equip each LCFRS rule with the prediction confidence of its supertag8 and use discodop (van Cranenburgh et al., 2016), an off-the-shelf statistical parser, to pick an LCFRS derivation which maximizes the product of prediction confidences.In this derivation, we replace each LCFRS rule by the supertag tuple it was taken from.
(vi) Transformation into constituent trees.The function listed in Algorithm 1 transforms the tree of supertags obtained in the previous step into a constituent structure for each node from top to bottom via recursive calls.Its first argument is the (sub-)tree to transform, the second one (i 1 ) is either None or a leaf that is transported from the top to a lower position.The leaf for the current supertag (i 2 ) is read from the lexical rule r (line 4).The list of children ⃗ d in the supertag tree and the index t determine how the list of (usually) two children ⃗ s is assembled (lines 6-12): each child in ⃗ d will establish a child in ⃗ s via a recursive call (line 9); if there is only one child in ⃗ d, t will determine which leaf in {i 1 , i 2 } is absorbed into ⃗ s (line 6 and 11); if there is no child then both i 1 and i 2 are in ⃗ s (line 11).Remaining leaves in {i 1 , i 2 } will be transported into the children according to t (line 6 and 9).If c is not None, the function adopts the constituent symbols (lines 13-15), otherwise the children in ⃗ s are merged with their siblings (lines 16 and 10).The pos symbol p for the leaf i 2 is stored in a set pos (line 5) and merged with the other pos symbols for each leaf (lines 9 and 15).(r, t, c, p) ← tag 3: for idx in {1, 2} do for c ′ in reverse(split(c, +)) do 14: return ⃗ s, pos

Experiments
Our experiments are conducted with the usual train/dev/test splits9 for the three discontinuous constituent treebanks DPTB, NeGra and Tiger.
For each treebank, we select parameters for the extraction using an incomplete grid search as described in Section 4.1.For the final models, we fine-tune each a bert-base-cased (bertbase-german-cased for NeGra and Tiger) and a bert-large-cased (gbert-large, respectively; Devlin et al., 2019;Chan et al., 2020) model with the selected parameter configurations for 20 epochs and report the parsing scores and speed in Table 6.We use the selected parameters for finetuning the same classifiers (Devlin et al., 2019, cf. sec. A.2). Appendix A lists the details for the classifier's training parameters and the used computing infrastructure.Results for models trained without any pretrained embeddings are shown in Appendix B. For each treebank, we conduct a parameter search in four steps to select a satisfactory configuration.
(i) First, we investigate which components of the supertag tuples to predict in tandem (core supertags) and which ones separately for each position in a sentence.
(ii) We investigate a set of combinations for nonterminal and guide constructors and eliminate underperforming ones.
(iii) The previously found set of combinations is investigated for each binarization configuration and one final configuration is chosen.
(iv) Finally, we assess how many tags per sentence position (k) shall be predicted.
Each step is implemented as a grid search.For each configuration in a grid, we fine-tune a bertbase model for 5 epochs using the supertags extracted from the training set, and evaluate using the dev set of the treebank.This process is documented for the NeGra treebank in the following paragraphs in very detail as an example.
(i) Core supertags.We investigate if there are advantages in predicting subsets of supertag components jointly (core) while the others are determined independently. 10The core components always include the grammar rule r.During parsing, we use the top k predictions for core components and only the best prediction for each other component.We ran this experiment with one parameter configuration (vanilla guide, classic nonterminals, lr bin with v = 1 and h = 0, k = 10).and propose that the results shown in Table 1 are clear enough to omit repeating this experiment with other configurations.They suggest that the prediction and parsing accuracies benefit from the absence of pos tags as well as the presence of both other components.Not including the pos tags, there are far less core supertag tuples and the quality of pos tag prediction is significantly better.We continue all following experiments with the core (r, t, c).
Here, we extract supertags for each combination of nonterminal and guide constructor.Binarization is fixed to lr (except for the modifier guide which requires ho) with h = 0 and v = 1, and k = 10.In Table 2, we can clearly see that both parameters determine the size of the grammar; of course that behavior was intended for the nonterminal constructors.Significantly less supertags are extracted using the strict and vanilla constructors than with the three other guides.Table 3 shows that the strict guide takes a clear lead in prediction and parsing accuracy.We continue the search restricted to the strict guide and all three nonterminal constructors.
(iii) Binarization.We extract supertags using the following configurations for binarization: ho and lr, each with horizontal markovization context h ∈ {0, 1} and vertical markovization context v ∈ {0, 1, 2}.Table 4 shows the parsing scores for supertags extracted using these combinations.Markovization contexts h > 0 and v > 1 do not seem to give us an advantage in this setting, it is clearly disadvantageous with vanilla and classic Table 4: Parsing scores (F1) in NeGra using supertags extracted with different configurations for binarization (rows distinguish lr and ho, columns distinguish values for h and v) and nonterminal constructors (rows).nonterminals.The impact of greater contexts is significantly less with coarse nonterminals.However, it does not benefit from higher values either.We select the final configuration for NeGra via the highest F1-score in the table: classic nonterminals and lr bin with h = 0 and v = 1.
(iv) Predictions per Position After training the final models with bert-base, we pick a suitable value for k from the set {5, 10, 15, 25, 40}.From the results in Table 5, we suggest that there is only one case where a value k ̸ = 10 shows improvements in quality that justifies the given decrease in speed, that is k = 15 for parsing NeGra.For both other treebanks, we continue with k = 10.

Final configurations.
In the parameter search, we found the following configurations for our final models: (DPTB) strict guide, classic nonterminals with lr binarization where h = 0 and v = 2, (NeGra) strict guide, classic nonterminals with lr binarization where h = 0 and v = 1, (Tiger) strict guide, coarse nonterminals with lr binarization where h = 0 and v = 1.Each model is trained to predict pos symbols separately from the other supertag components.We use the top k = 10 (DPTB, Tiger) and k = 15 (NeGra) predicted supertags during parsing.

Conclusion
We generalized the extraction of supertags from treebanks by introducing parameters for previously fixed parts of the construction.The parameters allow us to control the part of the constituent tree that is associated with a terminal for each supertag (guide constructor) as well as the granularity of the grammar rules (nonterminal constructor).At the same time, the extraction process was re-ordered so that its description is less convoluted while retaining the same functionality.
Some introduced guide and nonterminal constructors performed better than the vanilla variants.Specifically, we observe the following: While the highly ambiguous grammars extracted from DPTB benefit from finer nonterminal granularity with greater markovization window, the large and more specific grammars extracted from Tiger improve with coarser granularity; the grammar for NeGra lies somewhere in between.
Compared to the previous implementation of supertagging with LCFRS, we could improve the parsing quality across all three discontinuous treebanks.The improvements close the gap between the quality of parsing with LCFRS supertagging and the most recent discontinuous parsing strategies.In case of the two German treebanks, we could even surpass them, most notably in terms of the F1-score for discontinuous constituents.

Limitations
In our experiments, we chose to heavily restrict the set of hyperparameters for the supertag extraction and training of the neural model to finish them in feasible time.(We used fixed parameters for training and a step-wise incomplete grid search for the extraction.)Therefore, some interactions between parameters might still be concealed and optimal solutions yet to be found.
While we achieve state-of the art results with the two German treebanks NeGra and Tiger, our results fall behind the competition in the English DPTB.We could not find a concluding reason for that, and it should be further investigated.
We clearly emphasize that our implementation is a prototype, especially the reported parsing speed in Table 6 should not be considered final for the approach.The two major reasons are that • the statistical LCFRS parser that we used for step (v) in Section 3.4 is not optimized for lexical grammar rules and could possibly be improved upon, and • flair, the framework that we use for training the neural classifiers, is easy to use in development but rather slow during execution.In this approach we extract supertags equipped with lexical grammar rules using an injective mapping between binary constituent tree nodes and sentence positions (called guide, cf.step (iii) in Section 3.2).We suggest it could be sensible to associate some sentence positions with multiple constituent tree nodes or no nodes at all.E.g. we found a guide that maps each node position to its lexical head intuitive, but multiple nodes may share the same lexical head while some leaves are not the lexical head of any node.Even though we decouple the constituent symbols from the grammar rules in our formulation, the formalisms that we use can not keep track of the child relations in multiple constituent nodes per supertag.
Lastly, our approach inherits the problem of incomplete coverages from parsing with grammars.As Table 5 shows, there are still small amounts of sentences in all three used datasets that fail to parse in certain configurations.We chose to accept these in a trade-off with parsing speed.However, in settings where parse fails are critical, the extraction and parsing parameters should be selected very carefully.

B Supervised Classifier
We trained classifiers that only rely on the data in the training corpus (i.e.without pre-trained embeddings) using the same hyperparameters for the extraction that were found in Section 4.1.For that, we used the same supervised architecture for word encoding as Stanojević and Steedman (2020); Corro (2020): each word is embedded into the concatenation of per-word vectors and the output of a character-level bi-LSTM.These embeddings are then fed into a sentence-level bi-LSTM.Finally, the score of each tag prediction is computed using a multi-layer perceptron.
pos: WRB PT NN VBD VBN RP w: where the survey was carried out Figure 3: (Top) Visualization of the extraction (left to right) and parsing (right to left) process.(Bottom) Examples for the steps shown in the top.The results after steps (i) and (ii) are shown in tandem: gray boxes next to the inner nodes show the leaves assigned by the guide.Dashed gray boxes show parameters involved in the extraction.

Figure 4 :
Figure 4: Constituent structure and pos symbols after binarization (v = 1, h = 0; ho and lr binarization coincide) of Fig. 1.The symbol "VP|<>" was introduced during binarization; former unary nodes were joined by "+".Gray integers next to inner nodes show the leafs assigned by a guide for the constituent structure.

Figure 5 :
Figure 5: Guides defined by the constructors introduced in Section 3.3.Gray integers show the leaf assigned to each inner node for the binary constituent structure.Encircled leaves are not in the image of the guide.The guide defined by the shortest constructor is shown in Fig. 4.
Wojciech Skut, ThorstenBrants, Brigitte Krenn, and  Hans Uszkoreit.1998.A linguistically interpreted corpus of German newspaper text.In Proceedings of the ESSLLI Workshop on Recent Advances in Corpus Annotation, Saarbrücken, Germany.Miloš Stanojević and Mark Steedman.2020.Spanbased LCFRS-2 parsing.In Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies, pages 111-121, Online.Association for Computational Linguistics.Andreas van Cranenburgh, Remko Scha, and Rens Bod.2016.Data-oriented parsing with discontinuous constituents and function tags.Journal of Language Modelling, 4(1):57-111.Ashish Vaswani, Yonatan Bisk, Kenji Sagae, and Ryan Musa.2016.Supertagging with LSTMs.In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 232-237, San Diego, California.Association for Computational Linguistics.Yannick Versley.2016.Discontinuity (re) 2 -visited: A minimalist approach to pseudoprojective constituent parsing.In Proceedings of the Workshop on Discontinuous Structures in Natural Language Processing, pages 58-69, San Diego, California.Association for Computational Linguistics.David Vilares and Carlos Gómez-Rodríguez.2020.Discontinuous constituent parsing as sequence labeling.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2771-2785, Online.Association for Computational Linguistics.The following table contains the values for the neural classifier's architecture and hyperparameters during training.All experiments were run on the same compute server with an Intel Xeon Silver 4114 (40 cores 2.2GHz), 256GB RAM and an Nvidia GeForce RTX 2080.
The following table gives an overview of the used architecture and used hyperparameters.