Effective Batching for Recurrent Neural Network Grammars

As a language model that integrates traditional symbolic operations and flexible neural representations, recurrent neural network grammars (RNNGs) have attracted great attention from both scientific and engineering perspectives. However, RNNGs are known to be harder to scale due to the difficulty of batched training. In this paper, we propose effective batching for RNNGs, where every operation is computed in parallel with tensors across multiple sentences. Our PyTorch implementation effectively employs a GPU and achieves x6 speedup compared to the existing C++ DyNet implementation with model-independent auto-batching. Moreover, our batched RNNG also accelerates inference and achieves x20-150 speedup for beam search depending on beam sizes. Finally, we evaluate syntactic generalization performance of the scaled RNNG against the LSTM baseline, based on the large training data of 100M tokens from English Wikipedia and the broad-coverage targeted syntactic evaluation benchmark. Our RNNG implementation is available at https://github.com/aistairc/rnng-pytorch/.


Introduction
Neural language models have an excellent word prediction ability, which motivates researchers to develop several analysis methods for fine-grained evaluation, aiming at understanding which linguistic abilities the models have acquired during training (Linzen et al., 2016;Wilcox et al., 2018;Marvin and Linzen, 2018;Warstadt et al., 2020). So far, many efforts have been made on the evaluation of syntactic performance of models, including the abilities to resolve distant subject-verb number agreement in English. Since neural language models are the foundation of contemporary NLP systems, building a language model having robust sentence processing abilities like humans is an important goal, especially toward a system with human-like syntactic generalization abilities, not relying on the data-specific superficial cues found in the training data (McCoy et al., 2019;Linzen, 2020).
Past work has revealed that while sequential and unstructured models, such as LSTM and Transformer language models (Hochreiter and Schmidhuber, 1997;Vaswani et al., 2017), can induce several interesting syntactic behaviors, there is also a notable advantage in explicitly modeling syntax with specific architectures Wilcox et al., 2019;. The representative of such models is the recurrent neural network grammars (RNNGs; Dyer et al., 2016), the top-down, left-to-right generative models of a parse tree and sentence.
While these results may suggest that RNNGs are a better modeling choice for language, unfortunately, they have a practical drawback in terms of scalability, due to their structure-sensitive computation mechanism (Kuncoro et al., 2019). Since the computational graphs of RNNGs depend on the tree structures of the sentences, training cannot be mini-batched easily. This is in contrast to LSTMs and Transformers, for which token-wise operations can be batched across sentences, allowing efficient computation on GPUs, which is the key to the data scalability. Although RNNGs are claimed to be a fascinating language model, in practice, they still do not replace the unstructured, computationally favorable models like LSTMs.
In this paper, we directly address the data scalability issue of RNNGs by showing that most computations during training can be batched across sentences. At the computational core of RNNGs is stack LSTMs (Dyer et al., 2015). In past work, Ding and Koehn (2019) have already shown that stack LSTM update operations can be reduced to a tensor operation by implementing the stack as a sin-gle tensor with predefined maximum stack depth. Our work is built on this idea but with a few additional techniques to bridge the gap between the simple stack LSTMs and RNNGs. Importantly, we devise the efficient batching method for composition operations on the arbitrary number of stack items, which is unsolved in previous work.
The existing RNNG implementation is based on DyNet that supports the mechanism called Autobatch , which automatically finds mini-batch units from the independent computational graphs over multiple sentences with lazy computation. While this mechanism is modelindependent and allows intuitive implementation, the utility of this method rapidly plateaus as we increase the batch size. On the other hand, our present method allows effective parallel computation, increasing the training speed almost linearly as we increase the batch size.
In addition to this new batching mechanism for improved scalability, we also provide a new analysis on the role of the strong syntactic inductive bias for models that can access the larger amount of data. For syntactic generalization abilities, while  suggest that the model inductive bias plays a more important role than data scale, they also report that LSTMs or Transformers such as the offthe-shelf large-scale models (e.g., GPT-2 or JRNN) perform much better than their scale-controlled LSTMs. Does an RNNG, which already works relatively well on a modest amount of data, still benefit from the data scale to further strengthen its syntactic ability? We train a new RNNG on about 100M tokens in Wikipedia and evaluate its syntactic performance on SyntaxGym test circuits , finding that the data scale generally brings further performance gains, while the model tends to lose some heuristics on surface patterns that LSTMs seem to find. Our result suggests that RNNGs' reliance on structures will be strengthened with more data, motivating future research on developing better syntactic representation itself as supervision to structured language models.
A related approach to our work is adding the syntactic bias into sequential language models, such as LSTMs, with knowledge distillation from RN-NGs (Kuncoro et al., 2019(Kuncoro et al., , 2020. While motivations are similar, we provide a rather direct solution to resolve the scalability issue of RNNGs, opening up a new possibility of directly using them as an alternative to LSTMs. From another perspective, our work can be complementary to this work, because knowledge distillation requires a teacher RNNG model, which itself is costly to obtain. For example, Kuncoro et al. (2020) trained an RNNG on a relatively large dataset of 3.6M sentences, which is approximately similar to the training data we use. While the detail is missing, they report that training takes three weeks on a GPU. On the other hand, our models get almost converged in three days. This direct improvement in training time greatly expands the applicability of RNNGs including a teacher of sequential models, and more direct use in computational psycholinguistics  and NLP applications such as syntactic neural machine translation (Eriguchi et al., 2017).

Recurrent neural network grammars
RNNGs are joint generative models of a sentence and constituency tree. While RNN language models assign a next token probability, RNNGs assign a probability to next action, by which the parse state (stack LSTM) changes dynamically. In this work, we focus on the stack-only RNNG (Kuncoro et al., 2017), which has some resemblance to RNNs in that a single state vector h t defines next action probability a t : At each step, h t is obtained from the top element of stack LSTM, which preserves intermediate LSTM states up to h t . As a preparation for our batched RNNGs (Section 3), we try to formalize how this stack LSTM states change with each action. An RNNG internally preserves two different stacks: S h and S e . S h is a stack LSTM, keeping the LSTM hidden states h 0 · · · h t . 2 S e keeps stack elements, each of which is a word embedding e w , an open nonterminal embedding e x , or a closed constituent embedding e c obtained by REDUCE action.
At each step, the number of candidate actions is |N | + 2 given the set of nonterminal symbols N . Each action changes S h and S e as follows: • NT( By declaring the operations as above, we notice that the main reasons to prevent mini-batching are twofold: (1) the stacks have variable length, which varies at each step for each sentence; and more crucially, (2) internal operations in an action, especially in REDUCE and others, are largely different.
As we describe next, the issue regarding (1) has been largely solved in previous work. For (2), our strategy is essentially not joining different action types, but trying to improve the efficiency of each action as much as possible after grouping by action types. We find that in practice this strategy works quite well (Section 5.2), allowing models to benefit from a large batch size effectively.

Batched stack LSTMs
Ding and Koehn (2019) propose a sentence-level batched training algorithm for a restricted class of stack LSTMs designed for unlabeled dependency parsing without composition operations (Dyer et al., 2015). More specifically, Ding and Koehn (2019) deal with the parsing models defined by the following two operations only: 3 • PUSH: Push LSTM(top(S h ), e w ) to S h . e w is the embedding of the next token.
• POP: Pop the top element from S h .
At each step, the next action is either PUSH or POP for each sentence. This model still suffers from the problem (1) above. However, they show that by changing the data structure of stack, next PUSH and POP across sentences can be performed in batch. Given B sentences in a batch, let S i h be a stack for i-th sentence. What we need to do is to access all top elements of S i h (i ∈ [0, · · · , B − 1]) jointly, and this is possible by summarizing all stacks into a single stack tensor, denoted by S h , for which S h [i, p] denotes p-th element (LSTM state) on the stack for i-th sentence.
The core idea behind achieving PUSH and POP jointly is that we perform LSTM updates for all stack top elements in a batch, but only proceed top stack pointers for PUSH batches. Given next actions a =[PUSH, PUSH, POP, · · · ] of length B, we get a vector op = [+1, +1, −1, · · · ], denoting whether next stack pointer is +1 (PUSH) or -1 (POP). By keeping stack top pointer vector p h , each step can be batched as the following two operations: in which E w is the next token embeddings. Unfortunately, this batching relies on a strong assumption about models that one action (PUSH) involves all operations (LSTM update and pointer move by op). This is not the case for RNNGs, for which any action cannot be reduced to a subset of other actions, necessitating a different strategy.

Batched RNNGs
Our batching algorithm for RNNGs is built on the following two observations: (a) For all a t ∈ {NT, GEN, REDUCE}, the last step is common and corresponds to PUSH operation for stack LSTM above with newly created embeddings {e x , e w , e c }. This final step can be batched if we get all new embeddings as a single tensor E next (with size of (B, |e|)).
(b) Then, the main problem is reduced to getting E next efficiently. This is possible by separately filling E next for each action, using a few additional pointer vectors to keep track of a stack state for each sentence.
Algorithm 1 One training step for batched RNNG Input Next action vector at a; index vector for each action: igen, int, ired Move to next word.
To obtain E next , for NT and GEN, we just need to lookup embeddings for next words and nonterminal symbols. We need an additional effort to obtain multiple e c s at once. Assuming a stack tensor as in Ding and Koehn (2019) will not be accessed, so we can signify the remove of top nonterminals just by a decrement of p q without updating q.
We need a few additional tensors to achieve fully batched stack tensor operations. Figure 1 shows an example.
• S h : A tensor of size (B, D, L, H) when the stack LSTM has L layers with H hidden dimensions. The core of batched stack LSTMs.
• S e : A tensor of size (B, D, |e|), corresponding to S e in non-batched models (Section 2.1).
• b: A B-dimensional vector, keeping the next token index in the sentence. Figure 1: Example batched stack configuration. x| d means that x is at depth d. For example, "(NP it)" in the first sentence is closed so constitutes a single item on the stack. p q points to the top positions of q, which are underlined. . The operations are mainly categorized into filling E next for each action (in red), pointer updates according to action definitions (3, 5, 6, 10, 12), and finally stack updates (13,14), corresponding to the observed common operations (a). 4 gather children is a function that returns a tensor summarizing reduced children node embeddings. Since the number of reduced children differs across batch, we implement this to return a padded tensor, using gather function in PyTorch. Deviated from Ding and Koehn (2019), we separately perform each action, as indicated by the use of i a . This can be seen as a deficiency of our algorithm; however, this separation is necessary beyond very simple models, which are practically less attractive. Rather, our strategy can be applied to broader classes of structured neural models, including dependency parsing with composition, and we believe that our empirical success (Section 5) encourages further exploration of the presented strategy to various models.
How to set D? As in Ding and Koehn (2019), we need to specify the stack depth bound D for each batch. Increasing this value incurs more GPU memory. For training, we can precompute the minimum value of each sentence by simulating oracle transitions beforehand and use it. For inference, we fix D = 100, since we find that even for very long sentences (more than 150 words), the stack depth will never exceed 80 for English sentences.
A note on extra memory with stack tensors At first sight, our approach seems to suffer from the limitation in scalability due to fixed stack tensors (S h and S e ). The sizes of these tensors grow by model size, implying that we may not be able to employ a large batch size for a large model. In practice, however, this extra memory will not be a bottleneck in the total memory for training. This is because the main cause of required memory during training is rather a computational graph itself, which keeps all intermediate hidden states at each step. Our stack tensors can be seen as a "storage" to allow computing these intermediate values effectively with tensor operations. The extra memory for this storage is smaller than the total memory in a computational graph because the former depends on D while the latter depends on the total action length A, and D A in general. 5

Other Improvements
Batched beam search For inference as a language model or as an incremental parser, RN-NGs typically employ word-synchronous beam search (Stern et al., 2017;, which is although known to be very slow (Crabbé et al., 2019) because it often requires large beam sizes, such as 100 or 1000, and operations are not batched. As a by-product of our batched training, we succeed at implementing fully batched beam search for RNNGs, excluding any for loops, by which we drastically improve the search speed (Section 5.3). This is possible by adding the "beam" dimension to all state tensors (S h , q, etc.).
Subwords Given an increased amount of training data, the vocabulary size naturally increases.
To suppress this effect, using subwords (Sennrich et al., 2016) now becomes a standard technique. We thus incorporate subword modeling into our 5 Our preliminary experiment suggests that our RNNG implementation can be scaled at least comparable model and data sizes to ELMo (Peters et al., 2018), a large-scale LSTMbased model, given a similar amount of computing resources. We examine the maximum allowable batch size for a model with 1,256 hidden dimensions, amounting to 94M parameters, which are comparable to ELMo (93M), and find that the batch size can be increased to 256, with the maximum action size in a batch of 16,000 (see Section 5.1) on a single V100 GPU (16GB). Transformer-level scalability (Devlin et al., 2019) would still be infeasible because of the RNNG's limited paralellism that is only on sentence-level, not token-level as in Transformers.
RNNGs and employ it for our largest experiment in Section 6. Kuncoro et al. (2020) recently incorporate subwords in RNNGs, in which, each word is regarded as a new constituent with WORD label, e.g., (WORD cu| r| ry). This means that models always need to perform additional NT(WORD) and REDUCE for each token, even for unsegmented ones, e.g., (WORD I), greatly increasing the average action sequence length, which in turn affect the training time. In this work, we model subwords by a simpler method of just segmenting each token. For example, an NP looks like (NP Th| ai cu| r| ry). While Kuncoro et al. (2020) note that this simple modeling is less effective, our experiments suggest that this is a good enough strategy, considering the added computational costs with NT(WORD) actions. 6

Evaluating Efficiency of Batching
The main focus of this section is a comparative evaluation of our PyTorch RNNG implementation with the existing DyNet implementation. 7 We show that: (1) with a large batch size training speed drastically improves, and models will tend to find better parameters (Section 5.2); and (2) our batched beam search hugely speeds up inference (Section 5.3).

Setting
While Penn Treebank (PTB; Marcus et al., 1993) has often been used to train RNNGs (Wilcox et al., 2019, it is too small and here we use a larger dataset of BLLIP corpus (Charniak et al., 2000), expecting that the effects of large batch size become clearer by this modestly sized dataset.
Preprocessing For preprocessing, we largely follow , which also train an RNNG on this dataset. We partition the data according to their LG size, the largest training setting, amounting to 42 million tokens for training and 1,500 sentences for development. One difference we make is the handling of unknown tokens. We limit the vocabulary by the top frequent 50,000 word types in the training data.  use all word types 6 We provide a pilot study about this method in Appendix B. Using BLLIP corpus and Penn Treebank, we explore the relationship between a suitable number of subword units and model sizes. The main result is that large subword units are effective for larger models, and also subword modeling almost always improves parsing accuracy. 7 https://github.com/cpllab/rnng-incremental. This implementation supports word-synchronous beam search. For the training part, this is not implemented to use DyNet Autobatch so we modified it to enable that. that appear at least twice; however, this method vastly increases the vocabulary size and hence the model size. Unknown tokens are created in the same way with the Berkeley parser's surface feature rule (Petrov et al., 2006). The way to annotate parse trees is the same as well; we run Berkeley neural parser (Kitaev et al., 2019), a state-of-the-art constituency parser to assign accurate parses.

Model size and parameters
We experiment with the most common model size of RNNG in the literature: 256 dimensions for input and LSTM hidden dimensions, with 2 layer LSTMs (Dyer et al., 2016;. The total number of parameters is about 15M. The hyperparameters are summarized in Appendix A. Other settings We employ some additional techniques to improve the efficiency of our batching mechanism. First, before training, we group sentences by their number of gold actions so that examples in each mini-batch have similar numbers of actions. Specifically, we first sort the sentences by action lengths, divide by every 4096 sentences, and then sample each batch from a single group. Second, we predefine the maximum value for the total number of actions across sentences in a batch, which we set to 26,000. This is inspired by a similar mechanism in fairseq (Ott et al., 2019) for the maximum number of tokens. Using this means that the number of sentences in a batch will be adjusted to be smaller than the batch size when the action sequences (or sentences) are long, allowing us to interpret given batch size as the maximum that is fully exploited only for shorter sentences, which are in practice dominant in the data. Every experiment is run on a single V100 GPU with 16GB memory. Unless otherwise noted, we perform every experiment three times with different random seeds, reporting an average score with standard deviation.

Effects of batch sizes
Although our batched training involves actionspecific operations (Section 3), to our surprise, the efficiency improvement for our RNNG with large batch sizes is almost linear up to 256 (Figure 2). The improvement is narrow at 512, though this is mainly due to the restriction of the maximum number of actions in a batch (Section 5.1), which reduces the actual batch size for longer inputs. DyNet's Autobatch is quite effective up to 16, running much faster than ours due to the speed of C++, but further improvement is not obtained probably because of the increased overhead of finding mini-batch units themselves from a large computational graph.
Though this result clearly demonstrates the efficiency of our batching mechanism, it is only meaningful when the large batch size in fact leads to a faster model convergence. This is the case, as shown in Figure 3, where we compare the total validation losses as a function of actual wall clock time during training. The loss is calculated every 1000 batches. The model with batch size 512 converges fastest, and importantly, to better parameters. This result suggests that we can safely benefit from large batch size as long as memory permits. In the following experiments, we fix the batch size to 512. half-precision (fp16). We find that this does not change the results of beam search at all.

Beam search speed improvement
As we discuss in Section 4, we have also improved the efficiency of word-synchronous beam search, a standard technique to calculate incremental prefix probabilities (Hale, 2001) and a parse tree for RNNGs. Now, we evaluate the impact of this improvement. For PyTorch, we run it on V100 GPU; for DyNet, we find that it runs faster on CPUs so we instead use CPUs (Intel Xeon 6148, 20 cores x2), with Intel MKL. DyNet beam search is still too slow with this environment so we limit the number of tested sentences to 300 from the BLLIP development set. For PyTorch, we try two different batch sizes {1, 10}, with a restriction on the number of tokens in a batch, similarly to the total action size in training (Section 5.1). We fix this value to 250, with which the model can safely parse with the largest beam size of 1000.
Word-synchronous beam search employs two types of beam widths, action beam size (k) and word beam size (k w ), along with fast-tracked candidate size, denoted as k s (see Stern et al. 2017). k is most akin to the standard beam size. Table 1 summarizes the results when increasing k (others are in the caption). The beam search of DyNet becomes prohibitively slow when k ≥ 50. Strikingly, the increase in average runtime is more than linear against the beam size, especially for 10→50 and 50→100. The time increases, 0.5→11.3 (x22.6) and 11.3→48.6 (x4.3) are roughly quadratic to the increase of k (x5 and x2). This result is reasonable because in addition to the complexity of each step, which depends on k, the length of searched action sequence could also linearly grow by k. 9 The naïve DyNet implementation directly suffers from this computational cost.
Our batched beam search largely resolves this issue and now the average runtime only gradually increases by k. We note that as a parser or a language model, this speed is still not very fast, considering that this is on a GPU. 10 For the research purpose, including psycholinguistic assessments as 9 For a sentence of length N , the runtime of beam search is O(k × N × Mw), where Mw denotes the maximum number of actions between two tokens (until choosing next SHIFT). The expected number of actions between two tokens (bound by Mw) grows by k because at each step, with a large k the chance that non-shift beam items remain in the next beam increases; hence, the runtime becomes quadratic to k in the worst case. We conjecture that this inefficiency is bounded at some k (see k = 200), though is severe for smaller ks.
10 For smaller k, we can increase the batch size and the maximum number of tokens in a batch to further speedup.  Table 1: Word-synchronous beam search speed (average seconds per sentence) comparison on the first 300 sentences in BLLIP development set. B denotes the batch size. Word beam sizes (k w ) / fast track sizes (k s ) are 10/1, 10/1, 10/1, 20/2, 40/4, and 100/10, respectively.
done in Section 6, however, this improvement is significant, making experiments much easier even with large beam sizes. We still need to work on improving efficiency further, possibly by modifying learning methods to replace word-synchronous search (Stanojević and Steedman, 2020).

Syntactic Generalization Ability of Scaled RNNG
Finally, we evaluate the syntactic generalization abilities of the scaled RNNG. For this purpose, we adopt the test circuits used in  via SyntaxGym . Here, a test circuit is a collection of test suites; e.g., "Long-Distance Dependencies" circuit contains a suite on a specific type of "filler-gap dependencies" as well as a suite on (pseudo) "cleft". For each example in a suite, a model succeeds if it can assign a higher likelihood on a grammatically critical position in the correct sentence. For example, given "The farmer near the clerks knows/*know many people." in the "Agreement" circuit, a model is correct if it assigns p(knows|h) > p(know|h). Note that for subword models the total likelihoods on subwords (not averaged) are compared.
In the previous literature,  trained an RNNG on BLLIP (42M tokens). Here, we train subword RNNGs on 100M tokens from English Wikipedia, to which we assign parse trees with Berkeley neural parser. The model size is 35M with 30k subword units, following the experiment in Appendix B, which assesses the suitable number of subword units for different model sizes. We train this RNNG for three days (with three different seeds), and at inference fix the beam size k to 100 (k w = 10, k s = 1). We also train an RNNG with the subset of this data (42M tokens) to separate the effects of data size. Our LSTM baseline is the one used in Noji and Takamura (2020) (2018) benchmark, shown to work better than GRNN (Gulordava et al., 2018), one of the models used in . 11 The main result on circuit-level accuracies is summarized in Figure 4. On the effects of the data scale, we observe a consistent improvement from "RNNG (42M wiki)" to "RNNG (100M wiki)". This result suggests that this amount of increase in training data is still beneficial for structural language models to strengthen their syntactic generalization ability. For some circuits, only RNNG (100M) outperforms GPT-2 (Radford et al., 2019) on average ("Agreement" and "Licensing").
Comparing LSTM (100M) and RNNG (100M), RNNG generally outperforms LSTM, but with an exception on "Long-Distance Dependencies". In order to inspect this, we break down this circuit into suites, (Figure 5), finding that this deficiency of RNNG is due to its poor performance on (pseudo) "cleft", including the following example: (1) a. What he did was prepare the meal .
b. *What he ate was prepare the meal .
On underlined tokens, models should assign a higher likelihood for (1a). Our LSTM performs nearly perfectly on these cases while our RNNGs perform badly. We conjecture that given more data and/or parameters, RNNGs will tend to strengthen 11 Our LSTM implementation is available at https://github.com/aistairc/lm syntax negative. We train this LSTM on our subword-segmented Wikipedia (30k units). The model size is adjusted so that the total number of parameters becomes 35M, the same size as our RNNGs (3 layer LSTMs with 1150 hidden and 450 input dimensions). their commitment to provided syntactic supervision, and hence may lose some lexical heuristics which LSTMs can exploit from surface patterns (e.g., an association of did → prepare). In fact, the ability of LSTM on cleft is rather brittle, as shown in a huge drop on "cleft modifier", which include cleft constructions with intervening modifiers.
To rigorously handle these cases, models should notice that (1b) is a free relative clause and do not have an antecedent. However, the currently employed PTB annotation, which is limited to local structures, does not provide a distinction between these clause types, analyzing both as "(SBAR (WHNP What) (S (NP he) (VP did/ate)))", which our RNNGs predict correctly. We also notice that 's RNNG (H20) performs rather similarly to our LSTM, while our RNNG (42M), trained on the comparable size of data to H20, is more similar to our RNNG (100M), suggesting that RNNG's poor performance on cleft is not just due to the data scale. One possible explanation of the discrepancy between H20 and our 42M RNNG is that our RNNG might be better optimized thanks to improved training, or due to the sizes of hidden layers (256 for H20 and 656 for ours).
This problem poses a new interesting challenge. While RNNGs have been compared to LSTMs several times, the provided syntactic structures are fixed and effects of different annotations (formalism, quantity, etc.) are not explored. For such investigation, the training cost of RNNGs has been a practical burden, but that problem largely goes away with the current study. We expect that our new implementation and batching strategy provide fruitful future research opportunities on structured neural language models.

Conclusion
A large computational cost of training structured neural language models was a main practical burden for employing these models in applications and analyses. With special focus on RNNGs, we have provided a direct solution to this problem by showing that batched effective training is in fact possible. On the large scale experiments with SyntaxGym test circuits, we found that the data quantity further strengthens the syntactic generalization abilities of RNNGs, while the annotation quality or quantity will also be of practical importance towards a language model with human-like strong syntactic performance.

A Hyperparameters
We use the defualt parameter setting for the DyNet implementation. For our implementation, we use Adam optimizer (Kingma and Ba, 2015), which is found to be superior, while SGD has been used for DyNet implementation (Dyer et al., 2016;Wilcox et al., 2019). We set the learning rate and dropout rate to 0.001 and 0.1, respectively, which we find achieve lower validation loss robustly across different batch sizes.

B Effect of Number of Subword Units
We perform an experiment to understand the behaviors of our simple subword modeling (Section 4). We use the BLLIP corpus as preprocessed in Section 5.1 except the setting about vocabulary. We compare the fixed vocabulary models, which we train in the experiment of Section 5.2 (batch size 512), and several subword vocabulary models. The hyperparameters are the same as the fixed vocabulary models.

Model sizes
We prepare two different model sizes, 15M and 35M, to see the interaction between the suitable size of subword units and model size, by adjusting the number of two dimensions so that the total number of parameters becomes comparable to these numbers. For 15M parameter models, the dimensions are 528 for 10k units; 432 for 20k; and 336 for 30k. For 35M parameter models, these are 864, 752, and 656, respectively. The number of LSTM layers is fixed to 2.

Results
We investigate (1) the effectiveness of our simple subword modeling itself, and (2) whether the optimal number of subword units depends on model sizes. For (1), one way of evaluation is to compare the perplexities of subword models and fixed vocabulary models (see Section 5.1). However, they are not directly comparable because the fixed vocabulary models replace many tokens with unknown tokens, which are easy to predict and make the comparison unfair .  Table 2: PTB development set parsing accuracy (F1) when changing beam size, averaged on three models with different random seeds. V f ix is the vocabulary size for fixed vocabulary models while V sb is that for subword models.  is trained only on PTB training set and is not directly comparable. Word beam k w = k/10 and k s = k/100.  Table 3: Perplexity on BLLIP validation set for each setting described in Table 2, averaged on three models with different seeds. For V sb , perplexity is not subwordlevel but token-level, by summing subword likelihoods for each token.
We instead validate the effectiveness of our subword modeling by not language modeling, but parsing performance. Note that the text in the BLLIP corpus is Wall Street Journal, the same as Penn Treebank (PTB). Thus, we expect that the quality of auto parses provided to our training data is high, allowing us to assume that a good model should parse the gold PTB data more accurately. We run beam search on the PTB development (section 22) for each model and the results are summarized in Table 2. We can see that F1 scores consistently improve by subword modeling compared to the fixed vocabulary setting. The effects of model size (15M vs. 35M) are negligible, suggesting that the upper bound parsing performance using the current silver quality data can be reached with smaller models.
For (2) above, while comparing perplexities across subword and fixed-vocabulary models is impossible, comparing different subword units is possible by casting the subword-level likelihoods to token-level likelihoods (Mielke, 2019). Table 3 summarizes those values along with the results by fixed vocabulary models as reference. The effects of the number of subword units (V sb ) are clearer. For 15M models, the optimal V sb is 20k, while for larger 35M models, the optimal size is 30k. This suggests that more parameters are needed to obtain better results for large models.