StrAE: Autoencoding for Pre-Trained Embeddings using Explicit Structure

This work presents StrAE: a Structured Autoencoder framework that through strict adherence to explicit structure, and use of a novel contrastive objective over tree-structured representations, enables effective learning of multi-level representations. Through comparison over different forms of structure, we verify that our results are directly attributable to the informativeness of the structure provided as input, and show that this is not the case for existing tree models. We then further extend StrAE to allow the model to define its own compositions using a simple localised-merge algorithm. This variant, called Self-StrAE, outperforms baselines that don't involve explicit hierarchical compositions, and is comparable to models given informative structure (e.g. constituency parses). Our experiments are conducted in a data-constrained (circa 10M tokens) setting to help tease apart the contribution of the inductive bias to effective learning. However, we find that this framework can be robust to scale, and when extended to a much larger dataset (circa 100M tokens), our 430 parameter model performs comparably to a 6-layer RoBERTa many orders of magnitude larger in size. Our findings support the utility of incorporating explicit composition as an inductive bias for effective representation learning.


Introduction
Human understanding of natural language is generally attributed to the understanding of composition.The theory is that smaller constituents (words, tokens) recursively combine into larger constituents (phrases, sentences) in a hierarchically structured manner, and that it is our knowledge of said structure that drives understanding (Chomsky, 1956;Crain and Nakayama, 1987;de Marneffe et al., 2006;Pallier et al., 2011).Compositionality can help drive efficient and effective learning of semantics as it presumes knowledge only of the immediate constituents of a phrase and the process of their composition.However, the models that have come to dominate NLP in recent years generally do not explicitly take this property into account.
Transformers (Vaswani et al., 2017) have the capacity to model hierarchical compositions through the self-attention mechanism, whereby tokens can come to represent varying degrees of the surrounding context.However, should such behaviour occur, it is incidental to the training process, as there is no strict requirement for tokens to represent higher-order objects, and tokens are never explicitly merged and compressed with one another.Whether Transformers are able to acquire knowledge of syntax, as understood from the linguistics perspective, remains unclear (Kim et al., 2020).However, there is evidence that these models do become increasingly tree-like with sufficient scale and training steps (Jawahar et al., 2019;Murty et al., 2022).This raises an interesting question.To what extent is this drive towards tree-likeness responsible for their representation-learning capability?Furthermore, evidence from Lake et al. (2017) suggests that incorporating appropriate inductive biases towards composition can help bridge the stark disparity between the quantity of data required for learning between ML models and humans, and allow computational models to generalise better.If it is indeed an inductive bias towards hierarchical composition that enables humans to acquire multilevel semantics efficiently, and NLP architectures generally aren't explicitly tasked with modelling such structure, what happens when they are?
To investigate this, we develop StrAE, a Structured AutoEncoder framework.It takes as input a tree structure that represents a hierarchical composition process, and uses it to directly specify the flow of information from leaves to root (encoder) and back from root to leaves (decoder).To disentangle the effect of structure from other confounding factors, we constrain StrAE to only have access to the information immediately provided to it by the input tree.This means that for a given node the model can only use information from the nodes immediately connected to it (constituents): a property we refer to as faithfulness.Following this, we investigate which training objectives are best suited to enable structured multi-level representation learning, and present a novel application of the contrastive loss to hierarchical tree-like structure.
To investigate the utility of compositional structure for representation learning, we employ a dataconstrained setting (≈10 million tokens), and evaluate on a series of tasks that measure semantics at both the sentence and word level.We compare the representations learned by StrAE to a series of baselines consisting of other tree-models which do not enforce faithfulness as rigidly as StrAE and a series of baselines that do not use explicit structure at all.We also verify that our performance is attributable to the form of composition expressed by the tree structures by comparing results across a range of different input structures.
Finally, to investigate how useful a simple bias towards hierarchical composition is, we extend StrAE to allow it to define its own compositional hierarchy.This variant, called Self-StrAE, utilises the learned representations and the encoder to define its own "merge" sequence, which is then employed as the tree structure for the decoder.
Our results indicate that knowledge of structure is indeed beneficial, and that surprisingly even a simple bias for hierarchical composition leads to promising results.In light of these findings, we then extended our experiments to the 100 million token range and analyse how well Self-StrAE performs in a significantly larger setting.Even more surprisingly, despite solely having 430 non-embedding parameters, Self-StrAE is able to achieve comparable performance to a 6 layer RoBERTa (Liu et al., 2019) model with 3.95 million parameters.

Model
We develop a framework (StrAE) that processes a given sentence to generate multi-level embeddings by faithfully conforming to the given structure.Intuitively, it involves embedding the tokens of a sentence onto the leaves of the structure, composing these embeddings while traversing up the structure to generate the full-sentence embedding, and then traversing back down the structure while decomposing embeddings from parent to children, to finally recover the sentence itself-effectively a structured autoencoder.While this generates embeddings for multiple levels of the sentence as dictated by the structure, it also generates embeddings for a node in two directions-composing upwards and decomposing downwards.The composition embeddings represent the local context for a node, while the decomposition embeddings represent its full context given the sequence.
Denoting embeddings e i ∈ R N ×N , we distinguish embeddings formed traversing upwards and downwards as ēi and ēi respectively.Encodings for the tokens are denoted as the vertices w i ∈ ∆ V , in a V -simplex for vocabulary size V .Note that while the token encodings w i are effectively onehot, reconstructions ŵi can be any point on the simplex-interpreted as a distribution over vertices, and thus the vocabulary, with an argmax retrieving the appropriate token.We define the four core components of StrAE as where the subscripts on the functions denote learnable parameters.Table 1 shows the model components and their associated parameters.We assume square embeddings during composition and decomposition, to allow for independent channels to capture different aspects of meaning.Figure 1 shows how the model operates over a given sentence.
Figure 2: Contrastive objective over structure: corresponding node (green) is pulled closer, and other nodes (red) are pushed away.

Objectives
Given this autoencoding framework, a natural objective to employ is the cross entropy (CE), which simply measures at each leaf node how likely the target token w i is given the reconstructed distribution ŵi over the vocabulary.Given sentence s j = ⟨w i ⟩ T j i=1 , this objective is formulated as However, the CE objective places fairly minimal constraints on the embeddings themselves; it only requires the leaf embeddings to be 'close enough' for retrieval to be successful.The upward and downward intermediate embeddings are wholly unconstrained, and these can end up being quite different.Given we are learning embeddings for the whole tree, we would like to define an objective over all levels, not just the leaves.
For this purpose, we turn to contrastive loss (Chen et al., 2020;Radford et al., 2021;Shi et al., 2020).Contrastive loss involves optimising the embeddings of target pairs so that they are similar to each other and dissimilar to all negative examples.To adapt its use for structured embeddings, we apply the objective so as to task the model with maximising the similarity between corresponding upwards and downwards embeddings (ē i , ēi ) while simultaneously minimising the similarity between all other embeddings.This has the additional attractive characteristic that it forces amortisation of the upwards embeddings ēi -incorporating context from the sentences the word or phrase-segment might have occurred in, as the corresponding downward embedding ēi has full-sentence information in it.Figure 2 illustrates an example of this objective.
For a given batch of sentences s j , we denote the total number of nodes (internal + leaves) in the associated structure as M .We construct a pairwise similarity matrix A ∈ R M ×M between normalised upward embeddings ⟨ē i ⟩ M i=1 and normalised downward embeddings ⟨ ēi ⟩ M i=1 , using the cosine similarity metric (with appropriate flattening of embeddings).Denoting A i• , A •j , A ij the i th row, j th column, and (i, j) th entry of a matrix respectively, we define where σ τ (•) is the tempered softmax (temperature τ ), normalising over the unspecified ( • ) dimension.Note that L cont extends to the batch setting.Finally, we initialise our embedding matrix from a uniform distribution with the hyper-parameter r denoting range-corresponding to the finding that contrastive loss seeks to promote uniformity and alignment as established by Wang and Isola (2020).

Experimental Setup
We divide our experiments into two separate sections, but both share the same overall setup for pretraining data and evaluation.For pre-training data we take a 500k sentence (≈10M tokens) subset of English Wikipedia and 40k sentence development set.We restrict our pre-training data to this scale in order to measure our efficiency hypothesis presented in the introduction.For evaluation, we assess on three categories of tasks: word level semantics, sentence level semantics, and sentence pair classification, with each aiming to capture separate areas of semantic understanding.On the word level we use Simlex (Hill et al., 2015), Wordsim S and Wordsim R (Agirre et al., 2009).On the sentence level, we use three tasks from the STS suite (Agirre et al., 2016(Agirre et al., , 2012;;Cer et al., 2017), the SICK relatedness dataset (Marelli et al., 2014), and for sentence pair classification tasks we use RTE and MRPC taken from the GLUE Benchmark (Wang et al., 2019a).Table 2 provides an overview.Note that a subset of the tasks distinguish between similarity and relatedness.Briefly, the former measures semantic similarity, as between "running" and "dancing", as both words act as verbs.Relatedness measures semantic relationships such as between "running" and "Nike" where the words often cooccur together, but belong to different grammatical categories.Finally, we note that Simlex only measures similarity at the exclusion of relatedness.
All tasks apart from the final two classification tasks are measured using the Spearman correlation of the cosine similarity of model embeddings for each pair and human judgements; classification tasks are measured using accuracy.To emphasise the impact of the pre-training with structure, we do not fine-tune any models for the evaluation tasks, instead keeping them frozen.Where a classifier is required, we fine-tune only a task-specific classification head consisting of a FFN with a 512 dimensional hidden layer and an intermediate Tanh activation function.We choose this setup to match the GLUE Benchmark.For all experiments, we pretrain the models across 5 random seeds and present the averaged performance.Downstream classifiers are themselves also trained across 5 random seeds and the average reported.Additionally, for all experiments we set the embedding dimension to 100.Finally, for all models trained on the word level we filter the vocabulary to exclude words occurring fewer than two times in the data.

Comparing Tree Models
Here, we compare StrAE with two existing treearchitectures.These are the IORNN (Drozdov et al., 2020(Drozdov et al., , 2019;;Le and Zuidema, 2014) and the Tree-LSTM (Tai et al., 2015).Both architectures take structure as input and are able to traverse through the tree to learn representations.However, they differ in the constraints they impose on information flow.We conduct our experiments along two different axes: how well do the representations perform, and to what extent do the models discriminate between input structures?
To achieve these evaluations, we parse our pretraining set into three kinds of structure.The first are constituency trees extracted from CoreNLP (Manning et al., 2014) and binarised using NLTK (Bird et al., 2009).The two other kinds are purely right-branching and balanced binary trees, which we extract using standard algorithms.The resulting structures are then converted to DGL graphs (Wang et al., 2020).Hyper-parameter descriptions for each model can be found in Appendix A.

Baselines
The Inside-Outside Recursive Neural Network (IORNN) processes data hierarchically, working from the "inside" to the "outside".At each node in the tree, an IORNN maintains two vectors: The inside vector ē, which represents the local meaning of a given node, obtained by composing up the tree.The outside vector ē, represents the context for the given node obtained by decomposing down the tree.While superficially similar to StrAE, the models differ in an important aspect.For given parent p and children c1, c2 the outside representation: StrAE: where [•; •] denotes concatenation.In StrAE the outside representation solely depends on the parent and therefore enforces a compression bottleneck at the root of the structure.No other information may be shared from the composition process, and all ē embeddings are created recursively based on the root.The outside vector for IORNN is derived from both the parent and the inside vector of a given child's sibling node.As the root has no siblings or parent nodes, the outside vector consists of a global bias parameter, intended to represent the context of the whole pre-training corpus.Consequently, IORNN does not enforce a compression bottleneck and information flows between both the local compositional (bottom up) and the global decompositional (top down) contexts.Our second baseline, the Tree-LSTM, is a recursive variant of the LSTM (Hochreiter and Schmidhuber, 1997).The main difference is that inputs are processed recursively rather than recurrently.At each node, the inputs for the cell are the children's hidden and cell states.Here the flow of information differs to StrAE because while StrAE has to compress embeddings according to the order dictated by the input structure, the Tree-LSTM is able to selectively retain information from lower down in the tree according to the cell state.Tree-LSTMs can be applied both bottom up (Choi et al., 2017;Maillard et al., 2017;Tai et al., 2015) as encoders, and top down as decoders (Dong and Lapata, 2016;Xiao et al., 2022).Consequently, they can be pre-trained in the same way as StrAE.Parameter counts for all models can be found in Table 5.

Performance on Constituency Parses
We first evaluate model performance purely on the constituency parse structure, as in theory this should be the most informative.Results can be found in Table 3.We train all models using both objectives for the sake of parity with StrAE.Our results demonstrate that while all models are able to capture word level semantics to some degree, it is only StrAE coupled with the contrastive objective that is able to extend this to sentence level semantics, though StrAE with the cross entropy objective still performs better than the other architectures in this regard.These results indicate that enforcing faithfulness is beneficial in learning multilevel representations.It also follows that the contrastive objective is beneficial for StrAE as it is directly applied to all nodes, and therefore directly optimises the root's relations to other sentences.In the classification tasks, there appears to be little difference between architectures.
It is clear that the contrastive objective is only useful when the model imposes constraints on information flow, as both baselines do not benefit from it.The objective asks the model to reconstruct each node embedding such that it is distinctive from all other embeddings in the batch.If the LSTM is retaining information in its cell state, it makes this distinction difficult to enforce.IORNN faces difficulties in two regards: the outside representation of the root is a global parameter which makes distinction significantly more difficult, and the sharing of information between sibling nodes.We tested the effects of removing the sharing of information and still found little improvement, possibly because the degree of information sharing between inside and outside representations is so high that it renders the requirement for meaningful compression void.
Evidence of IORNN's reliance on the information sharing from inside sibling nodes can be seen in its Simlex performance with the cross entropy objective.Simlex actively penalises capturing semantic relatedness, which in the case of IORNN is information that would be provided by the outside vector.Its poor performance indicates that IORNN is not using the parent outside representation as significantly when reconstructing embeddings, which may explain the difficulties on the sentence level tasks as the model can simply try to predict a word given its immediate left or right neighbour.

Performance Across Input Structures
We also evaluate how dependent the performance of each model is on input structure.We compare the performance of each model between constituency parses, right-branching and balanced binary trees as input.Plots of these results are in Fig. 3, and the full tables can be seen in Appendix D. StrAE is dependent on input structure for performance, particularly when using the contrastive objective, though this still holds true across all task areas even for cross entropy.On the other hand, the baseline models are not, though this varies in degree.IORNN trained with cross entropy actually performs best with right-branching  structure, though this is skewed by word level performance.Conversely, the LSTM appears to discriminate, but this is solely due to word level results and does not extend to other areas of evaluation.Summary StrAE is tasked with taking a sequence and compressing it into a single representation through ordered merges as dictated by the input tree.Each non-leaf node acts as a compression bottleneck.As a result, the contribution of each input token to the final root representation is directly dependent on the merge order.This makes StrAE structure sensitive.When coupled with contrastive loss, StrAE must learn embeddings such that similarity is dependent on how sequences compose.This is because the objective is now over all nodes, and all non-leaf node embeddings are defined by merge order.Secondly, it requires that the merge order have some degree of consistency to enable reconstruction, something which purely right-branching and balanced binary trees do not provide.It is the combination of strict compositional bottlenecking and the contrastive objective that enables StrAE to learn effective multi-level representations and serve as a probe for the utility of structure, unlike the tree structured baselines.

Comparison to Unstructured Baselines
Here, we compare StrAE to a series of baselines that are not provided any form of parse tree as input.
The aim is to evaluate the usefulness of explicit structure against other inductive biases.We also introduce a variant of StrAE called Self-StrAE that is not provided structure as input but must learn its own mode of composition.It serves as a measure of the utility of an explicit compositional bottleneck.

Self-StrAE
Self-StrAE (Self-Structuring AutoEncoder) modifies the encoder so it uses greedy agglomerative clustering in order to decide which tokens to compose.Unlike StrAE where the order of compositions is defined by the input structure, in Self-StrAE this is dictated by the embeddings.Self-StrAE orders compositions according to cosine similarity between adjacent node embeddings in the frontier, merging the argmax at each step.Figure Fig. 4 shows this process.In the figure, the similarity between 'ate' and 'doughnuts' is greater than that of 'ate' to 'Homer', so the model first merges 'ate doughnuts' into a single embedding using the composition function.At the next step, the model merges 'Homer' and 'ate doughnuts' into a single embedding and arrives at the root.The algorithm is provided in Appendix B. We save the merge history in an adjacency matrix so that by the time the encoder reaches the root node, the adjacency matrix represents the whole tree over the input sequence.We then convert the adjacency matrix into a DGL graph, and pass that as input to the decoder.The decoder operates exactly the same as in vanilla StrAE, only that in this case the input graph is defined by the encoder's merge history as opposed to, e.g., a syntactic parser.

Baselines
We selected Fasttext (Bojanowski et al., 2016), a Bi-LSTM (Hochreiter and Schmidhuber, 1997) and RoBERTa (Liu et al., 2019) as baselines.Fasttext leverages distributional semantics and subword information to learn word embeddings, but only op- erates over a fixed window size.The Bi-LSTM allows us to measure the utility of sequential vs hierarchical information processing.Finally, RoBERTa serves as a suitable Transformer baseline, because it only utilises the MLM objective, rather using NSP as with BERT (Devlin et al., 2019), and this has been shown to be more robust.Furthermore, it does not require additional data labelling, such as identifying which sentences follow each other, which provides greater parity with Self-StrAE and Fasttext as both models do not have such additional labels as input.For both models, we set the embedding dimensionality to 100 to match StrAE and set the number of attention heads in RoBERTa to 10 in order to match StrAE's channels.We set the number of layers in RoBERTa to 6 as, in a data constrained setting there was no additional benefit observed with a greater number, and the parameter disparity between StrAE and RoBERTa is already substantial (see Table 5 for parameter counts for both the tree and unstructured baselines).
The other hyperparameter details for all baselines can be found in Appendix A. To produce sentence embeddings we take the mean of the word embeddings with Fasttext and the mean of the final layer token representations for RoBERTa.To produce word embeddings for RoBERTa we follow the lessons from Jawahar et al. ( 2019); Vulic et al. (2020), which state that lexical information is contained in the lower layers.For cases where a word is broken into multiple subwords, we take the average of the embeddings from layers 0-2.Where a word is present in the vocabulary, we simply use its embedding, as there is no context to provide it with.In the case of the Bi-LSTM we produce sentence embeddings by passing the concatenated final hidden states for both directions through a learned linear layer in order to produce a single 100-dimensional embedding.We found this to lead to improved performance compared to simply using the concatenation.The same strategy is used to produce word embeddings in the case where it is broken into multiple subwords.We contrast these models with StrAE trained on constituency parses with contrastive loss, Self-StrAE trained on the word level using the same vocabulary as before, and finally Self-StrAE starting from the subword level.We take the vocabulary from the same BPE (Sennrich et al., 2016) tokeniser used by RoBERTa (minus the special tokens) with a total size of 25000.We train Self-StrAE with the contrastive objective because this proved significantly more effective.

Million Tokens
As shown in Table 4, while StrAE performs best overall, Self-StrAE is able to achieve comparable performance across both the word and sentence level.Fasttext performs best on Simlex, but this is likely to be for the same reasons as IORNN i.e., capturing similarity at the expense of relatedness.This is also indicated by its comparative lower Wordsim R performance.On both other lexical semantics tasks StrAE and Self-StrAE both outperform it.Fasttext also struggles on the sentence level.Similarly, the Bi-LSTM performs well on lexical semantics, but struggles at capturing higher levels.This is evidenced both by the STS results and lower lexical performance of the BPE Bi-LSTM compared with Self-StrAE.RoBERTa performs comparably to StrAE on STS, but struggles on the word level.This might simply be because Transformers aren't designed to learn static lexical embeddings, but could also be the result of our data-constrained pretraining setting, as there was significant variability between seeds.Consequently, we conducted a final experiment using a significantly larger pre-training set to assess the impact of scale.100 Million tokens For this experiment, we turned to the WikiText-103 benchmark dataset (Merity et al., 2016).WikiText-103 consists of 103 million tokens, with an average sequence length of 118.Each input sequence corresponds to an article rather than a sentence, as in our original dataset.We set the maximum sequence length to 512 and train a subword tokeniser on the training set with a maximum vocabulary of 25000.This vocabulary is used for both RoBERTa and Self-StrAE.As shown in Table 6, under this setting, RoBERTa improves significantly on the word level.RoBERTa also shows improvement on the sentence level, though these are less pronounced as the model was already performing well on these tasks.Surprisingly, Self-StrAE is able to achieve comparable (and in some cases better) performance than the RoBERTa model, despite having orders of magnitude fewer parameters.We can only attribute Self-StrAE's performance to the inductive bias behind it: that the model must perform hierarchical compositions of its input sequence.Which we believe speaks strongly to its merits.

Summary
Comparison with the unstructured baselines shows that explicitly incorporating hierarchical compositions to be beneficial for multi-level representation learning.With Self-StrAE we show that these benefits do necessitate an external parser for preprocessing, and can largely be achieved through an inductive bias for explicit merging.

Related Work
Recursive Neural Networks: StrAE belongs to the class of recursive neural networks (RvNNs).First popularised by Socher et al. (2011Socher et al. ( , 2013)), who employed RvNNs to perform fine-grained sentiment analysis, utilising tree structure to overcome the deficits of bag of words models.These early successes inspired the creation of successor frameworks like the IORNN (Le and Zuidema, 2014) and Tree-LSTM (Tai et al., 2015) we used as baselines.
Learning Tree Structure: The induction of structure has long been a goal of NLP.Early pivotal work on corpus based induction was performed by Klein and Manning (2004);Petrov et al. (2006), and enabled the development of work on treestructured models that followed.A recent prominent approach is the C-PCFG (Kim et al., 2019).Induction Through Representation Learning: DIORA and subsequently S-DIORA (Drozdov et al., 2020(Drozdov et al., , 2019) ) are models which induce structure using the IORNN.Instead of providing a fixed tree as input, they train by using dynamic programming over the set of all possible trees based on how well IORNN is able to reconstruct the input sequence.At test time, they use CKY to extract the highest scoring tree as the parse, which achieved SOTA on unsupervised consituency parsing.While these do learn representations, they solely utilise it to enable unsupervised parsing.
Learning Task Specific Tree Structures: Inspired by Socher et al. (2011) prior work has sought to learn trees for supervised tasks (Choi et al., 2017;Maillard et al., 2017;Yogatama et al., 2017), under the assumption that a particular task requires its own form of composition.They achieved success, but all use the Tree-LSTM, which was shown to be largely structure agnostic in the supervised setting (Shi et al., 2018); a finding we confirm in this work.In all cases, it was found that structure aided with language modelling perplexity and generalisation.
Structure and Representation Learning: This area, the focus of our paper, remains largely underexplored.Prior work has solely examined the word level, using dependency parses to define the context neighbourhood for a given word.The first work to do this was Levy and Goldberg (2014) and achieved promising results, however, they faced issues with vocabulary size becoming intractable.Vashishth et al. (2019) alleviated this issue through the use of GCNs (Kipf and Welling, 2016).
7 Discussion and Future Work We establish two findings.Firstly, defining representation similarity through composition is useful, and secondly, asking a model to arrange its own composition is a powerful inductive bias.Neither of these findings are limited in their application to the architecture presented in this paper.The requirements are: an explicit merge operation and an objective that optimises representations across all levels.As long as these conditions are met, the findings are in theory applicable to any number of architectures.For future development, the natural next step is to examine what happens when we allow for a significantly more flexible models to dictate their own compositions.Transformers can be naturally adapted to incorporate such a bias, and we believe this holds considerable promise for future work, especially in light of recent findings that incorporating compression bottlenecks in Transformers is beneficial (Nawrot et al., 2023(Nawrot et al., , 2021)).Extensions need not be limited to the Transformer architecture, either.Recurrent and recursive neural networks share many similar features, and given the recent resurgence of RNNs (Orvieto et al., 2023;Peng et al., 2023) there may also be promise in extending research in this direction.

Limitations
The particular Self-StrAE presented in this paper is a considerably limited model.It is minimally parameterised (see Table 5), locally greedy and makes uncontextualised decisions about which nodes to merge.The mode of composition it learns is certain to be suboptimal as a result of this, and it speaks to the strength of the inductive bias that it is able to perform at all.The structure it learns certainly doesn't resemble syntax as we understand it (example trees can be found in Appendix C), and neither did we expect it to.A significantly more flexible model would likely be required in this regard.Even then, it is possible that the form of composition a model learns with respect to its training objective may deviate substantially from our expectations of how compositional structure should look.Secondly, the aim of this paper is to conduct basic research into the benefits of composition and not to outperform the state of the art.We believe there is promise in pursuing further research that may eventually lead to improvements over SOTA, but we leave this to future work.

A Training Data and Hyper-parameters
We trained each StrAE (and the tree baselines) model for 15 epochs (sufficient for convergence) using the Adam optimizer with a learning rate of 1e-3 for cross entropy and 1e-4 for contrastive loss, using a batch size of 128.We applied dropout of 0.2 on the embeddings and 0.1 on the composition and decomposition functions.The temperature hyper-parameter for the contrastive objective was set to 0.2 and the r value 0.1.These settings were obtained by a grid search over the values r= 1.0, 0.1, 0.01, batch size = 128, 256, 512, 768, τ = 0.2, 0.4, 0.6, 0.8 and learning rate 1e-3, 5e-4, and 1e-4.The Fasttext baseline was trained for 15 epochs with a learning rate of 1e-3, and a window size of 10.Self-StrAE was trained using the same hyperparameters as StrAE, except on Wikitext-103 where we lowered the learning rate to 5e-5 and increased τ to 0.6, and decreased r to 0.0001.RoBERTa was trained for 100 epochs, with a 10% of steps used for warmup, a learning rate of 5e-5 and a linear schedule.We used relative key-query positional embeddings.The Bi-LSTM was trained for 15 epochs with learning rate 1e-3, batch size 128 and dropout of 0.2 applied to the embeddings and output layer.T ← T ∪{(i, i+1)} ▷ record merge location 9: return Merge order T

C Tree Statistics and Examples
Self-StrAE learns trees which are not purely rightbranching, balanced binary, or purely random.We parse our development split using Self-StrAE BPE models pre-trained on the 10M corpus.The split itself consists of 40k sentences with an average length of 23.58.The trees from Self-StrAE have an average depth of 9, compared with 23 (rounding up for simplicity) for right-branching trees, and 5 for balanced binary trees.Self-StrAE trees exhibit a slight preference for right-branching, with each non-leaf node on average having fewer left than right successors 60% of the time.
However, the best way to get a sense for the kind of structures the model learns is by looking at examples shown on the following pages.Looking at the examples in Fig. 5 and Fig. 6 the model has learned some sensible pattens.For example, it has learned to segment sentences with embedded clauses or conjunctions into their constituent parts (e.g.Fig. 5a,e,g and Fig. 6a).However, the trees frequently exhibit attachment errors that we hypothesize are the result of structure being determined by co-occurrence frequency.For example, in Fig. 6a the model merges [will be] as its own constituent rather than the correct parse of [will [be cancelled]].Which is likely because "will" and "be" co-occur much more frequently than any instance of "be" + passive form of a given verb.Similar behaviour can be found all throughout the examples in Fig. 5 and Fig. 6.Given the simplicity of the model it is unsurprising that Self-StrAE is unable to learn deeper rules, however, it would be interesting to determine to what extent this is also a function of the data it is trained on.Recent work has shown that transcribed child directed speech leads transformer models to learn grammar more efficiently (Huebner et al., 2021;Mueller and Linzen, 2023).Transcribed CDS is far less regular than Wikipedia and may cause the model to avoid simpler heuristics like bigram frequency.
Finally, there are cases where the model seems to be learning totally implausible structures.We are yet to determine the root cause of this, but include examples in Fig. 7 for the sake of transparency.

D Performance by Structure Type
We show in Table 7 a comparison of performance for different structure types and objectives used in this work.As discussed earlier, the objectives used here are contrastive (C) and standard crossentropy (CE), and the structure types explored are the syntactic, purely right branching (RB) and the balanced binary (BB) trees.

Figure 1 :
Figure 1: StrAE operation to encode and decode a sentence.Shared colours indicate shared parameters.

Figure 3 :
Figure 3: Average performance for models on different task areas by structure type, higher is better.

Table 1 :
Definitions of model components following § 2. Functions square and flatten transform between column vectors and square matrices.Functions hcat and hsplit perform horizontal concatenation and (middle) splitting respectively.σ(•) denotes the softmax function.ϕ and θ denote additive biases.

Table 2 :
Overview of Evaluation Tasks

Table 3 :
Comparison of StrAE, IORNN and Tree-LSTM embedding performance on our evaluation suite.Higher is better.All tasks that use Spearman's ρ have the result * 100 and are marked with a †.Score represents the average across all tasks.All models were trained over five random seeds using constituency parses as input.The pre-training objective is indicated by C or CE, representing contrastive loss or cross entropy respectively.Only results where there is no standard deviation overlap between model performance are bolded.

Table 4 :
Comparison to unstructured baselines.Higher is better.All tasks that use Spearman's ρ have the result * 100 and are marked with a †.Score represents the average across all tasks.All models were trained over five random seeds.Only results where there is no standard deviation overlap between model performance are bolded.

Table 5 :
Number of parameters for our models and baselines.Following convention we exclude the embedding matrix (and LM head if applicable) from the count.