Learning to Compose Representations of Different Encoder Layers towards Improving Compositional Generalization

Recent studies have shown that sequence-to-sequence (seq2seq) models struggle with compositional generalization (CG), i.e., the ability to systematically generalize to unseen compositions of seen components. There is mounting evidence that one of the reasons hindering CG is the representation of the encoder uppermost layer is entangled, i.e., the syntactic and semantic representations of sequences are entangled. However, we consider that the previously identified representation entanglement problem is not comprehensive enough. Additionally, we hypothesize that the source keys and values representations passing into different decoder layers are also entangled. Starting from this intuition, we propose \textsc{CompoSition} (\textbf{Compo}se \textbf{S}yntactic and Semant\textbf{i}c Representa\textbf{tion}s), an extension to seq2seq models which learns to compose representations of different encoder layers dynamically for different tasks, since recent studies reveal that the bottom layers of the Transformer encoder contain more syntactic information and the top ones contain more semantic information. Specifically, we introduce a \textit{composed layer} between the encoder and decoder to compose different encoder layers' representations to generate specific keys and values passing into different decoder layers. \textsc{CompoSition} achieves competitive results on two comprehensive and realistic benchmarks, which empirically demonstrates the effectiveness of our proposal. Codes are available at~\url{https://github.com/thinkaboutzero/COMPOSITION}.


Introduction
A crucial property of human language learning is its compositional generalization (CG) -the algebraic ability to understand and produce a potentially infinite number of novel combinations from known components (Fodor and Pylyshyn, 1988;Lake et al (Li et al., 2021) show the workflow of how humans exhibit CG.Suppose interpreters know the translation: [丢失了狗] for "lost the dog" and [他喜欢] for "he liked" (semantic information).When they first encounter "lost the dog he liked", they can correctly translate [丢失了他喜欢的 狗] instead of [丢失了狗他喜欢] depending on Pattern 2.3 (syntactic information).2017).For example, if a person knows "the doctor watches a movie" [医生看电影]1 and "the lawyer" [律师], then it is natural for the person to know the translation of "the lawyer watches a movie" is [律师看电影] even though they have never seen it before.Such nature is beneficial for generalizing to new compositions of previously observed elements, which is often required in real-world scenarios.Despite astonishing successes across a broad range of natural language understanding and generation tasks (Sutskever et al., 2014;Dong and Lapata, 2016;Vaswani et al., 2017), neural network models, in particular the very popular sequence-tosequence (seq2seq) architecture, are argued difficult to capture the compositional structure of human language (Lake and Baroni, 2018;Keysers et al., 2020;Li et al., 2021).A key reason for failure on CG is different semantic factors (e.g., lexical meaning and syntactic patterns) required by CG are entangled, which was proved explicitly or implicitly to exist in the representation of the encoder uppermost layer (encoder entanglement problem) by previous studies (Li et al., 2019;Raunak et al., 2019;Russin et al., 2019;Liu et al., 2020bLiu et al., , 2021;;Jiang and Bansal, 2021;Zheng and Lapata, 2022a;Yin et al., 2022;Ruis and Lake, 2022;Li et al., 2022;Cazzaro et al., 2023).In other words, the syntactic and semantic representations of sequences are entangled.
In order to alleviate the encoder entanglement problem, one line of research on CG mainly concentrate on improving the encoder representation or separating the learning of syntax and semantics which adopt similar approaches to humans' strategies for CG (see Figure 1).Specifically, several works either produce two separate syntactic and semantic representations, and then compose them (Li et al., 2019;Russin et al., 2019;Jiang and Bansal, 2021) or design external modules, and then employ a multi-stage generation process (Liu et al., 2020b(Liu et al., , 2021;;Ruis and Lake, 2022;Li et al., 2022;Cazzaro et al., 2023).Moreover, some studies explore bag-of-words pre-training (Raunak et al., 2019), newly decoded target context (Zheng and Lapata, 2022a,b) or prototypes of token representations over the training set (Yin et al., 2022) to improve the encoder representation.Furthermore, we hypothesize that the source keys and values representations passing into different decoder layers are also entangled (keys, values entanglement problem), not just the representation of the encoder uppermost layer.We will further illustrate it in Section 5.1.
Therefore, one natural question can be raised: how to alleviate keys, values entanglement problem.As a remedy, we examine CG from a new perspective to solve it, i.e., utilizing different encoder layers' information.We conduct preliminary analysis provided in Appendix A, and conclude that the bottom layers of the Transformer encoder contain more syntactic information and the top ones contain more semantic information.Inspired by this, we collect representations outputted by each encoder layer instead of separating the learning of syntax and semantics.So one intuitive solution to solve keys, values entanglement problem is to learn different and specific combinations of syntactic and semantic information (i.e., representations outputted by each encoder layer) for keys and values of different decoder layers.We argue that an effective composition is to provide different combinations for different tasks and a specific combination for a particular task.For example, the model can learn preference of layers in different levels of the encoder for different tasks (i.e., For A task, the information at encoder layer 0 may be more important, however, for B task, the information at encoder layer 5 may be more important).Additionally, the model can select which encoder layer of information is most suitable for itself (that is, which encoder layer of information is the most important) for a particular task.Inspired by that, we propose the composed layer (learnable scalars or vectors) to generate different specific source keys and values passing into different decoder layers for different particular tasks, since we argue that the learned scalars or vectors (i.e., different dynamic composition modes) by the model itself during training process can be dynamically adjusted for different particular tasks, and provide a way to learn preference of layers in different levels of the encoder for a particular task.Putting everything together, we propose COMPOSITION (Compose Syntactic and Semantic Representations), an extension to seq2seq models that learns to compose the syntactic and semantic representations of sequences dynamically for different tasks.COMPOSITION is simple yet effective, and mostly applicable to any seq2seq models without any dataset or task-specific modification.
Experimental results on CFQ (Keysers et al., 2020) (semantic parsing) and CoGnition (Li et al., 2021) (machine translation, MT) empirically show that our method can improve generalization performance, outperforming competitive baselines and other techniques.Notably, COMPOSITION achieves 19.2% and 50.2% (about 32%, 20% relative improvements) for instance-level and aggregate-level error reduction rates on CoGnition.Extensive analyses demonstrate that composing the syntactic and semantic representations of sequences dynamically for different tasks leads to better generalization results.
Encoder Layer Fusion.Encoder layer fusion (En-coderFusion) is a technique to fuse all the encoder layers (instead of the uppermost layer) for seq2seq models, which has been proven beneficial, such as layer attention (Bapna et al., 2018;Shen et al., 2018;Wang et al., 2019), layer aggregation (Dou et al., 2018;Wang et al., 2018;Dou et al., 2019), and layer-wise coordination (He et al., 2018;Liu et al., 2020a).However, other studies show that exploiting low-layer encoder representations fails to improve model performance (Domhan, 2018).The essence of different EncoderFusion works is to explore different ways to combine information from different encoder layers.Our approach is essentially the same as EncoderFusion work, which explores different ways to combine information from different encoder layers, however, we propose a new way to combine them.Meanwhile, we consider that there are also three distinct differences.Firstly, our method exploits information from all encoder sub-layers and generates specific keys, values passing into different decoder layers while they do not.Secondly, our method shows the effectiveness of utilizing low-layer encoder representations while they have the opposite view (see Appendix D).Thirdly, we do not share the same motivation or task.Their work focuses on how to transform information across layers in deep neural network scenarios for seq2seq tasks.Our motivation is to compose the syntactic and semantic representations of sequences dynamically for CG.

Methodology
We adopt the Transformer architecture (Vaswani et al., 2017) to clarify our method, however, our proposed method is mostly applicable to any seq2seq models.In the following, we first introduce the Transformer baseline (Section 3.1), and then our proposed COMPOSITION (Section 3.2).

Transformer
The Transformer (Vaswani et al., 2017) is designed for sequence to sequence tasks which adopts an encoder-decoder architecture.The multi-layer encoder summarizes a source sequence into a contextualized representation and another multi-layer decoder produces the target sequence conditioned on the encoded representation.
Formally, given a sequence of source sentence X = {x 1 , . . ., x S } and a sequence of target sentence Y = {y 1 , . . ., y T }, where S, T denote the number of source and target tokens, respectively.D = {(X, Y ), . ..} denotes a training corpus, V denotes the vocabulary of D, and θ denotes parameters of the Transformer model.The model aims to estimate the conditional probability p(y 1 , . . ., y T |x 1 , . . ., x S ): where t is the index of each time step, y <t denotes a prefix of Y and each factor p(y t |X, y 1 , . . ., y t−1 ; θ) is defined as a sof tmax distribution of V.
During training, the model are generally optimized with the cross-entropy (CE) loss, which is calculated as follows:  During inference, the model predicts the probabilities of target tokens in an auto-regressive mode and generates target sentences using a heuristic search algorithm, such as beam search (Freitag and Al-Onaizan, 2017).

COMPOSITION
Our proposed COMPOSITION extends the Transformer by introducing a composed layer between the encoder and decoder.Figure 2 shows the overall architecture of our approach.

Composed Layer
The composed layer is a list consisting of 2N learnable vectors due to 2N source keys, values passing into N decoder layers, where each vector involves 2M learnable scalars or vectors.M, N denote the number of encoder and decoder layers respectively.

Dynamic Combination
Here, we describe how to use the composed layer to compose collected representations dynamically for generating specific keys and values representations passing into different decoder layers.Let f Self −Attention and f F eed−F orward denote a Transformer self-attention sub-layer and feed-forward sub-layer respectively.The embedding layer of the Transformer encoder first maps X to embeddings H 0 , and then H 0 are fed into a Transformer self-attention sub-layer and feed-forward sub-layer to generate tively, where d denotes the hidden size.Next, each subsequent encoder layer takes the previous layer's output as input.The overall process is as follows: where 2 ≤ i ≤ M denote i-th encoder layer.Therefore, we can collect representations outputted by each encoder sub-layer The keys and values of multi-head attention module of decoder layer l are defined to be: where w i k ∈ R, w i v ∈ R are learnable scalars or vectors and mutually different (e.g.
, which weight each collected source representation in a dynamic linear manner.Eq. 7 and 8 provide a way to learn preference of sub-layers in different levels of the encoder.

Experiments
We mainly evaluate COMPOSITION on two comprehensive and realistic benchmarks for measuring CG, including CFQ (Keysers et al., 2020) and CoGnition (Li et al., 2021).

Experimental Settings
Datasets.CoGnition is a recently released realistic English → Chinese (En→Zh) translation dataset, which is used to systematically evaluate CG in MT scenarios.It consists of a training set of 196,246 sentence pairs, a validation set and a test set of 10,000 samples.In particular, it also has a dedicated synthetic test set (i.e., CG-test set) consisting of 10,800 sentences containing novel compounds, so that the ratio of compounds that are correctly translated can be computed to evaluate the model's  ability of CG directly.CFQ is automatically generated from a set of rules in a way that precisely tracks which rules (atoms) and rule combinations (compounds) of each example.In this way, we can generate three splits with maximum compound divergence (MCD) while guaranteeing a small atom divergence between train and test sets, where large compound divergence denotes the test set involves more examples with unseen syntactic structures.We evaluate our method on all three splits.Each split dataset consists of a training set of 95,743, a validation set and a test set of 11,968 examples.Figure 3 shows examples of them.Data Preprocess.We follow the same settings of Li et al. (2021) and Keysers et al. (2020) to preprocess CoGnition and CFQ datasets separately.For CoGnition, we use an open-source Chinese tokenizer2 to preprocess Chinese and apply Moses tokenizer3 to preprocess English, which is the same in Lin et al. (2023) and Liu et al. (2023).We employ byte-pair encoding (BPE) (Sennrich et al., 2016) for Chinese with 3,000 merge operations, generating a vocabulary of 5,500 subwords.We do not apply BPE for English due to the small vocabulary (i.e., 2000).For CFQ, we use the GPT2-BPE tokenizer4 to preprocess source and target English text.Setup.For CoGnition and CFQ, we follow the same experimental settings and configurations of Li et al. (2021) and Zheng and Lapata (2022a) repspectively.We implement all comparison models and COMPOSITION with an open source Fairseq toolkit (Ott et al., 2019).More details are provided in Appendix B. Evaluation Metrics.For CoGnition, we use compound translation error rate (CTER (Li et al., 2021)) to measure the model's ability of CG.Specifically, instance-level CTER denotes the ratio of samples where the novel compounds are translated incorrectly, and aggregate-level CTER denotes the ratio of samples where the novel compounds suffer at least one incorrect translation when aggregating all 5 contexts.To calculate CTER, Li et al. (2021) manually construct a dictionary for all the atoms based on the training set, since each atom contains different translations.We also report characterlevel BLEU scores (Papineni et al., 2002) using SacreBLEU (Post, 2018) as a supplement.For CFQ, we use exact match accuracy to evaluate model performance, where natural language utterances are mapped to meaning representations.

Model Settings
Machine Translation.We compare our method with previous competitive systems: (1) Transformer (Vaswani et al., 2017)

Results on CoGnition
The main results on CoGnition are shown in Table 1.We observe that: (1) COMPOSITION gives 20.4% CTER Inst and 52.0%CTER Aggr , with a  significant improvement of 8.0% and 10.9% accordingly compared to the Transformer.Moreover, COMPOSITION significantly outperforms most baseline models under the almost same parameter settings,6 indicating composing the syntactic and semantic information of sequences dynamically for a particular task is more beneficial to CG.Although Transformer+CReg achieves slightly better performance and contains fewer parameters, it is more complex and costly compared with COM-POSITION; (2) COMPOSITION, COMPOSITION-Rela, COMPOSITION-Small and COMPOSITION-Deep can deliver various performance improvements, demonstrating the general effectiveness of our method; (3) COMPOSITION-Deep performs better than Bow, Dangle and Proto-Transformer, indicating that focusing on alleviating the encoder entanglement problem only can achieve part of goals of CG as mentioned in Section 1.Compared to SeqMix, the improvement of COMPOSITION is more significant (2.3% vs 10.9% CTER Aggr ).SeqMix utilizes linear interpolation in the input embedding space to reduce representation sparsity, and we suppose that the samples synthesized randomly may be unreasonable and harmful to model training; (4) It can be seen that Transformer is even slightly better than DLCL, indicating DLCL and COMPOSITION do not share the same motivation or scenario.

Results on CFQ
The main results on CFQ are presented in Table 2.We observe that: (1) RoBERTa is comparable to T5-11B, T5-11B-mod and outperforms other baseline systems without pre-training except HPD, indicating that pre-training indeed benefits CFQ; (2) COM-  POSITION substantially boosts the performance of RoBERTa (43.4 → 59.4), about 37% relative improvements, and is in fact superior to T5-11B and T5-11B-mod.It also outperforms other baseline systems without pre-training except HPD.This result demonstrates that pre-training as a solution to CG also has limitations, and also indicates that COMPOSITION is complementary to pre-trained models; (3) HPD performs better than Dangle, RoBERTa+CReg and COMPOSITION, achieving 67.3 exact match accuracy, which is highly optimized for the CFQ dataset.On the contrary, COM-POSITION, RoBERTa+CReg and Dangle are generally applicable to any seq2seq models for solving any seq2seq tasks including MT, as mentioned in Section 4.3.However, compared with competitive performance on CoGnition, the improvements brought by COMPOSITION is relatively moderate, and even worse than Dangle.The underlying reason is related to a recent finding that compositionality in natural language is much more complex than the rigid, arithmetic-like operations (Li et al., 2021;Zheng and Lapata, 2022a;Dankers et al., 2022).MT is paradigmatically close to the tasks typically considered for testing compositionality in natural language, while our approach is more suitable for dealing with such scenarios.

Analysis
In this section, we conduct in-depth analyses of COMPOSITION to provide a comprehensive understanding of the individual contributions of each component.For all experiments, we train a COM-POSITION (6-6 encoder and decoder layers) instead of other experimental settings on the CoGnition dataset, unless otherwise specified.

Effects of Specific Keys and Values of Different Decoder Layers
As mentioned in Section 1 and 3.2, we hypothesize that keys, values entanglement problem ex-  values of different decoder layers utilize more than just the information of the encoder topmost layer.More importantly, it also emphasizes our method provides an effective composition of syntactic and semantic information, i.e., a specific combination for a particular task.To further demonstrate it, we also provide a toy experiment in Appendix C.

Effects of Composing Information of Encoder Layers or Sub-layers
As mentioned in Section 3, the Transformer encoder layer consists of two sub-layers.We assume that sub-layers may contain language information in different aspects, which may produce better generalization results.Therefore, we are curious about whether composing different encoder layers' or sub-layers' information is more beneficial to CG.
In this experiment, we investigate its influence on CoGnition.Specifically, we train COMPOSITION to compose representations outputted by either Eq. 5 or 6 or a combination of both dynamically.
Results are presented in Table 4.We observe certain improvements (-6.2% and -5.8% CTER Inst ) when separately composing SA-and FF-level representations, where SA and FF denote representations outputted by Eq. 5 and 6 respectively.Furthermore, the combination of both them brings further improvement (-8.0%CTER Inst ), which illustrates that the information in different encoder sub-layers is complementary and has cumulative gains.It also suggests that syntactic and semantic information brought by SA or FF is similar, but slightly different (Li et al., 2020), and can improve generalization performance respectively.It can be seen that the results of COMPOSITION-SA and COMPOSITION-FF presented in Table 4 are basically the same, and the improvements brought by the combination of both them is relatively moderate.(He liked to wear each other's clothes.)(The waiter he liked wore each other's clothes.) The waiter he liked came 服务员来了，把那个恶霸赶走了。 他喜欢的服务员过来把那个恶霸赶走了。 by and chased the bully off.
(The waiter came by and chased the bully off.)(The waiter he liked came by and chased the bully off.) The waiter he liked 服务员喜欢拿起他的邮件。 他喜欢的服务员拿起了他的邮件。 picked up his mail.
(The waiter liked to pick up his mail.)(The waiter he liked picked up his mail.) Table 5: Example translations of Transformer vs COMPOSITION.The bold characters denote the novel compounds and corresponding translations.

Effects on Compositional Generalization
Compound Length and Context Length.Longer compounds have more complex semantic information and longer contexts are harder to comprehend, making them more difficult to generalize (Li et al., 2021).We classify the test samples by compound length and context length, and calculate the CTER Inst .In Figure 5, we can observe that COM-POSITION generalizes better as the compound and context grows longer compared to Transformer.In particular, COMPOSITION gives a lower CTER by 11.0% over samples when the context length is more longer than 13 tokens.It suggests that our approach can better captures the compositional structure of human language.
Complex Modifier.The postpositive modifier atom (MOD) is used to enrich the information of its preceding word (e.g., he liked in the phrase lost the dog he liked), which is challenging to translate due to word reordering from English to Chinese.We divide the test samples into two groups according to compounds with (w/) or without (wo/) MOD.In Figure 6, we observe that the advantage of COMPO-SITION grows larger in translating the compounds with MOD, demonstrating its superiority in processing complex semantic composition.
Case Study.We present 3 source examples containing a novel compound the waiter he liked with MOD and 4 atoms, and their translations in Table 5.For all samples, correct translations denote that the novel compounds are translated correctly.COMPOSITION correctly translates the novel compounds across different contexts for all samples, while Transformer suffers from omitting different atoms.For example, the translation of the waiter is omitted in the first example, he liked is omitted in the second example and he is omitted in the third example.Our results not only contain the correct compound translations but also achieve better translation quality, while Transformer makes errors on unseen compositions, confirming the necessity of composing the syntactic and semantic representations of sequences dynamically.

Conclusion
In

Limitations
There are two limitations of our approach.Firstly, compared with competitive performance on CoGnition, the improvements brought by COMPOSITION on CFQ is relatively moderate, and even worse than some competitive methods.Hence, COMPOSITION is more suitable for tasks typically considered for testing compositionality in natural language.We strongly recommend researchers pay more attention to tasks evaluating compositionality on natural language.Meanwhile, we regard that designing a more general method that can improve generalization performance in both synthetic and natural scenarios is a promising direction to explore in the future.Secondly, our method is mostly applicable to any seq2seq models which adopt an encoder-decoder architecture instead of encoderonly or decoder-only architecture.However, the methodology of the proposed COMPOSITION is still rather general to any seq2seq models which adopt any architecture, since we can use the randomly initialized encoder or decoder to constitute the encoder-decoder architecture.

A Preliminary Analysis
In this section, we analyze the amount of syntactic and semantic information captured by different encoder layers in the Transformer under MT scenarios.We aim at analyzing the representations learned by different encoder layers of different models through probing the encoder as input representation for various prediction tasks.We measure the importance of input features for various tasks by evaluating the ability of the decoder.Specifically, we use a fixed encoder representation as input and two different tasks, i.e., Part-of-Speech (POS) tagging, and Semantic tagging, to evaluate the syntactic and semantic information contained in different encoder layers respectively.The reason is that we assume if the input representation effectively captures a property (syntactic or semantic information), then the decoder can easily predict that property.
To explore the precise effects of information captured by different encoder layers, we train the Transformer on the WMT18 English → Chinese (EnZh, rich-resource), English → Estonian (EnEt,  et al., 2017) for POS tagging (syntactic task) and the annotated data from the Parallel Meaning Bank (PMB) (Abzianidze et al., 2017) for Semantic tagging (semantic task). 13We use precision to evaluate model performance.
Results on POS tagging and Semantic tagging are presented in Figure 7 and 8 respectively.We observe that: • For EnEt and EnZh, the performance tends to decrease as the number of layers increase.
• For EnEt and EnZh, the performance tends to increase as the number of layers increase.syntactic information and the top ones contain more semantic information, and the information encoded by each encoder layer transforms from syntactic to semantic as the number of layers increase.

B Experimental Settings
For CoGnition, we set hidden size to 512 and feedforward dimension to 1,024.The number of encoder and decoder layers are 6, 6 and the number of attention heads are 4.The model parameters are optimized by Adam (Kingma and Ba, 2015), with β 1 = 0.9, β 2 = 0.98.The learning rate is set to 5e-4 and the number of warm-steps is 4000.We set max tokens as 8,192 tokens for iteration.We use one GeForce GTX 2080Ti for training with 100,000 steps and decoding.We report the average performance over 6 random seeds provided in Li et al. (2021).We train all COMPOSITION models from scratch.For CFQ, we use the base RoBERTa with 12 encoder layers, which is combined with a Transformer decoder that has 2 decoder layers with hidden size 256 and feed-forward dimension 512.We use a separate target vocabulary.The number of attention heads are 8.The model parameters are optimized by Adam (Kingma and Ba, 2015), with β 1 = 0.9, β 2 = 0.98.The learning rate is set to 1e-4 and the number of warm-steps is 4000.We set max tokens as 4,096 tokens for iteration.We use one GeForce GTX 2080Ti for training with 45,000 steps and decoding.We report the average performance over 3 random seeds provided in Zheng and Lapata (2022a).We train COMPOSITION built on top of RoBERTa with full parameter fine-tuning.

C Effects of the Effective Composition
As mentioned in Section 3, we introduce the composed layer between the encoder and decoder to compose different encoder sub-layers' information dynamically to generate specific keys and values passing into different decoder layers.We show curiosity about whether the composed layer can fuse all encoder sub-layers' information effectively.Therefore, we conduct a toy experiment on CoGnition.Specifically, all encoder sub-layers' information is accumulated to serve as the same key and value passing into every decoder layer (called Transformer-accu),14 rather than composing them dynamically like we do.Results are listed in Table 6.Transformer-accu even fails to train.It suggests that even if the syntactic and semantic information of sequences is considered, the inappropriate combinations will instead bring noise to significantly affect the model's CG performance.

D Effects of Representations from Low-layer Encoder
To verify the low-layer encoder representations are also essential to our approach, we only evaluate our approach on CoGnition with the collected encoder representations of the top three layers.Results are presented in Table 7.We can observe that only composing the representations of the top three encoder layers leads to a sharp drop in performance (27.0%vs 20.4% CTER Inst ), but still outperforms the Transformer baseline (27.0%vs 28.4% CTER Inst ).It further demonstrates the distinct difference between our method and the findings introduced by previous studies on EncoderFusion.It also reflects our starting point is correct, i.e., exploring how to compose syntactic and semantic information.It can be seen that COMPOSITION's performance is dramatically reduced given only semantic information (the last three encoder layers' information).

E Reasons for Experiments on CoGnition without Language Models
We do not conduct experiments on CoGnition with language models for two reasons.First, CoGnition is constructed to test CG performance in MT scenarios with simple sentence pairs (see Figure 3), however, language models are trained on vast amounts of multilingual sentences or bilingual sentence pairs.It is contrary to the compositional generalization task itself, since we can not guarantee that every sentence in the test set is a novel combination from known components for language models.Second, it is unfair to compare large language models with systems without pre-training.We strongly recommend researchers pay more attention to conduct experiments on CoGniton without language models.

Figure 2 :
Figure 2: Architecture of COMPOSITION based on the Transformer.The bright yellow block in the middle denotes the composed layer introduced in Section 3.2.The red line denotes that we collect representations of the same positions for the rest encoder layers.

Figure 4 :
Figure 4: Learned composition weights (after normalized) that each encoder layer (y-axis) attending to keys or values of different decoder layers (x-axis).

Figure 5 :
Figure 5: CTER Inst of COMPOSITION and Transformer over the different compound and context lengths.

Table 1 :
CTERs (%) on CoGnition.We report instance-level and aggregate-level CTERs in the CG-test set, separated by "/".In addition, we also report the commonly used metric BLEU score in MT tasks."-" denotes that the results are not provided in the original paper.Results are averaged over 6 random runs.

Table 2 :
: first proposes a new Exact-match accuracy on different MCD splits of CFQ.Results are averaged over 3 random runs.

Table 3 :
CTERs (%) against alleviating E or K,V on the CG-test set, where CTER Inst and CTER Aggr denote instance-level and aggregate-level CTER respectively.E and K, V denote encoder and keys, values entanglement problem respectively.

Table 4 :
CTERs (%) against composing different source information on the CG-test set.