Disentangling Reasoning Capabilities from Language Models with Compositional Reasoning Transformers

This paper presents ReasonFormer, a unified reasoning framework for mirroring the modular and compositional reasoning process of humans in complex decision-making. Inspired by dual-process theory in cognitive science, the representation module (automatic thinking) and reasoning modules (controlled thinking) are decoupled to capture different levels of cognition. Upon the top of the representation module, the pre-trained reasoning modules are modular and professional in specific and fundamental reasoning skills (e.g., logic, simple QA, etc). To mimic the controlled compositional thinking process, different reasoning modules are dynamically activated and composed in both parallel and cascaded manners to control what reasoning skills are activated and how deep the reasoning process will be reached to solve the current problems. The unified reasoning framework solves multiple tasks with a single model, and is trained and inferred in an end-to-end manner. Evaluated on 11 datasets requiring different reasoning skills and complexity, ReasonFormer demonstrates substantial performance boosts, revealing the compositional reasoning ability. Few-shot experiments exhibit better generalization ability by learning to compose pre-trained skills for new tasks with limited data, and decoupling the representation module and the reasoning modules. Further analysis shows the modularity of reasoning modules as different tasks activate distinct reasoning skills at different reasoning depths.


Introduction
Prevailing language models (LMs) (Devlin et al., 2018;Brown et al., 2020) demonstrate impressive performance in natural language processing tasks, and have ushered in a new trend in AI research.Despite the emerging fervor, the homogeneous LMs relying on a single call of the model are less modular and are hard to explicitly model the complex reasoning process (Helwe et al., 2021) like humans.
In the dual-process theory (Daniel, 2017) in cognitive psychology, there are two cognitive systems interacted to form a whole reasoning process.System 1 (automatic thinking) generates intuitive patterns of ideas, and System 2 (controlled thinking) constructs reasoning in an orderly logical series of compositional reasoning processes.Besides, in the process of System 2, different functional brain areas could be modular and interact with each other.System 2 can decide how to compose different reasoning skills and when to stop thinking.As the example shown in Fig. 1, when finding the cause of a car accident, humans intuitively comprehend the question (System 1), and then conduct compositional reasoning (System 2: recalling fact → logical deduction → answering question).
We would like to incorporate this mechanism into AI models in decision-making, and make the following assumptions: (1) the representation module (System 1) and reasoning module (System 2) can be decoupled and (2) the "complicated" reasoning process can be disentangled into multi-step executions of compositional "fundamental" reasoning modules, whose compositionality can be learnt with limited data.Also, the "fundamental" nature of basic reasoning skills allows them to have rich training instances for reliable skill pre-training.
Under these motivations, this paper proposes the modular and compositional reasoning framework -ReasonFormer, to mirror human's compositional reasoning process, with the following characteristics: (1) the representation module and reasoning modules are decoupled; (2) reasoning modules are modular and professional in fundamental reasoning skills; (3) reasoning modules are compositional in parallel and cascaded manner, to dynamically decide the activated reasoning skills and the reasoning complexity; (4) the general-purpose reasoning framework is end-to-end and unified in solving multiple tasks with one model.Specifically, the representation module learns contextual representations of problems.Upon the top of the it, there are cascaded reasoning modules to perform compositional multi-step reasoning.The reasoning modules are pre-trained to expert in specific reasoning skills (e.g., logic, QA, fact, etc.).These pre-trained reasoning skills are considered relatively fundamental and have rich resources.Two additional blocks complete the whole framework: the reasoning router and the reasoning adapter.The reasoning router decides which reasoning skills are activated in each reasoning step, and when to stop the reasoning process.The adapter adapts the reused reasoning modules to different steps of the reasoning process.
We comprehensively evaluate the framework on 11 datasets emphasizing different reasoning skills and complexity, and highlight the following findings: (1) Substantial performance boosts demonstrate models' harvest of compositional reasoning ability, and both the reasoning-centric pre-training and reasoning adapter bring compounding performance gains.(2) Results of few-shot experiments show that specialized modules enables better generalization by learning to compose pre-trained skills for low-resource tasks, and decoupling of representation module and reasoning modules.(3) Further analysis reveals the distinct reasoning skills required for different tasks at different reasoning depths, shoring up the modularity of reasoning modules.

Reasoning Skills Formulation
The compositional reasoning process of LMs' relies on the pre-training of several fundamental reasoning skills and their compositionality.Hence, the selection of skills is critical.
Selection Principles.There are two major principles in selecting skills: (1) Fundamental: Complex problems can be decomposed and solved by simpler basic skills.So the basic skills should be more fundamental, well-defined, and can be covered in the required skill set of as many tasks as possible; (2) Resourceful: Reliable skill pre-training requires large-scale pre-training data.However, in the real-world scenario, the annotated data is expensive to obtain for most reasoning tasks.So it is expected that there are already rich resource or data can be collected via self(semi)-supervised manner.
Basic Skills Selection.Humans always solve complex problem with fundamental skills, like understanding key information (e.g., entity and its type) of events, recalling related facts, understanding causal relations between events, and extracting answers for the question.This motivates us to select the following basic skills: the logic ability to logically deduce the cause or consequence of events; simple question answering (QA) to understand the context and answer simple questions; named entity recognition (NER) to identify important entities in the context; natural language inference (NLI) to identify semantic relevance of two sentences and factual knowledge to memorize commonsense knowledge and understand daily events.There is an additional general skill to learn the commonly shared knowledge across selected skills.We keep this setting in our paper as they are relatively well defined and resourceful2 .
We adopt self-supervised methods to construct pre-training corpus for {logic ability, factual knowledge, NER}, semi-supervised method to construct pre-training corpus for simple QA, and large-scale supervised data for NLI.

ReasonFormer Framework
As shown in Fig. 2, the general-purpose reasoning framework is built based on encoder-decoder architecture to process multiple tasks (i.e., all pretraining tasks and downstream tasks) with a unified model, where all tasks are tackled as unified text-totext generation tasks.We first reformat all the tasks into the same format using hard prompts (Sanh et al., 2021).For example, the question-answering task input can be prompted with the template: The question is {Question}.Please give the answer:", and the expected output is the answer text.
Given the prompted task inputs, the modular and compositional framework consists of two components in its encoder: the representation module (System 1) and the reasoning modules (System 2).
The representation module ( § 3.1) captures the intuitive understanding of problems by calculating initial contextual representations.Upon the top of the representation module, there are several pretrained reasoning modules ( § 3.2) with different reasoning skills, waiting for interaction to form a compositional reasoning process.For reasoning process organization, there are reasoning routers ( § 3.2.2) to decide the (parallel) activated skills and when to stop the (cascaded) reasoning process.

Representation Module
Similar to the perceptive function of System 1, the representation module targets basic contextual understanding, and builds the foundation of the following-up reasoning process.As LMs exhibit impressive ability on contextual understanding, we build the representation module with cascaded Transformer layers.Given the tokenized input X with length m, the initial representations learnt from representation module are denoted as: where [CLS] is a special token.

Reasoning Modules
To simulate the cognitive process (System 2) formed by controlled interaction between various functional areas in human brains, the reasoning modules are modular and compositional.Reasoning modules (RMs) learn different reasoning skills specified during pre-training, and are automatically composed during downstream adaptation ( § 3.3) with reasoning router ( § 3.2.2).Compositionality is not only at the parallel level (different skills), but also at the cascaded level (multi-step reasoning) Since different reasoning steps intuitively model different levels of information, there are additional reasoning adapters to adapt the reused modules to different reasoning steps.

Reasoning Modules Architecture
Each reasoning module is implemented by several Transformer layers.As shown in Fig. 2(b), the shared reasoning modules with the same skill at different reasoning depths have shared parameters (excluding the reasoning adapter).For example, Fact modules at steps {0, 1, ..., n} share major parameters.The output from the last reasoning step will be recursively taken as the inputs of the reused reasoning modules with step-specific adapters.
Reasoning Adapter.To adapt the reused reasoning module to different depths of the reasoning process, we add step-specific reasoning adapters to the reasoning modules.Inspired by Houlsby et al. (2019) on domain adaptation, as shown in Fig. 2, we add two reasoning adapters following the multi-head attention layer and FFN layer in the Transformer layer of reasoning modules.Besides, the reasoning adapters for different skills and different reasoning depths are non-shared.

Reasoning Router
To compose the reasoning process, the reasoning router is critical in deciding which skills are activated per step, and how many reasoning steps are required for problem-solving.As the example in Fig. 1, problem-solving needs to recall facts, and make logical deductions, then answer questions.Therefore, the activated skills and reasoning depths may varied for every instance.
At the parallel level of each step, the skill router calculates activating scores for reasoning modules.After each reasoning step, the stop gate decides whether executed reasoning steps are sufficient in problem-solving through a stop gating mechanism.
Unlike Mixture-of-Experts (MoE) (Shazeer et al., 2017) that uses token-wise routing, we adopt an instance-level routing strategy, which can capture more comprehensive semantics of problems.Skill Router.Since the i th reasoning step has n reasoning modules: {R 1 , • • • , R n } and a skill router S i , the output H i of the i th reasoning step can be calculated by router-weighted averaged outputs from the k activated reasoning modules: where S i ( Hi−1 ) j (scalar weight) and R j ( Hi−1 ) (updated hidden vectors) are the outputs from the router and the j th reasoning module, respectively.Since deciding the skills is a non-trivial task, we adopt a relatively complex router for deeper understanding.We use one Transformer layer T to project the original output for routing weight calculation.Then, we use an FFN layer followed by a Softmax function for weighted score calculation: Afterwards, we sparsely activate (Shazeer et al., 2017) k reasoning modules with top-k skill routing scores at each reasoning step.The router training objectives are detailed in § 3.3.
Stop Gate.After each reasoning step, the stop gate decides whether the current reasoning depth is sufficient to solve the problem.Taking H i as the input, the stop gate uses a residual gating mechanism G i stop to control the information flow from executed reasoning steps and calculate the final output Hi for the i th reasoning step by: An FFN layer is used as the stop gate G i stop .When the reasoning process is sufficient, the following-up process will be softly stopped by G i stop .

Pre-training and Adaptation
The unified model enables multi-task learning for both pre-training and downstream tasks.The major difference between pre-training and adaptation is that only in the pre-training stage we have the supervision for the activated skills.
Pre-training.Before reasoning pre-training, the model weights of ReasonFormer are initialized with pre-trained weights from T5 (Raffel et al., 2020).The details of model initialization and pretraining corpus collection are introduced in § 4.3.1 and § 4.2, respectively.Since model acknowledges which skill it is learning, we add skill routing loss L r in addition to the teacher-forcing loss, to guide the routers in activating skills.For example, if the current instance focuses on logic ability, it should activate {logic ability, general} skills.L r can be set as the cross-entropy loss for the multi-skill classification, where the activated skill has label 1 and 0 otherwise.During pre-training, all the reasoning steps activate the same skill for one instance.
Adaptation.During downstream adaptation, we have no prior knowledge about the required skills for different tasks, so we expect the model can automatically learn which skills are essential for each specific task.Therefore, we adopt standard teacher-forcing loss for generative training.
During Evaluation, the Hotpot QA adopts Exact Match (EM) as the metric, while the rest tasks use accuracy as the metric.The answer for multichoice QA and classification tasks are selected by the highest log-likelihood scores of options.

Pre-training Corpus
To reduce the manual efforts in data collection, we mainly select self(semi)-supervised pre-training corpus construction methods.
To improve LMs' logic ability, we adopt the self-supervised logical pre-training corpus built by LogiGAN (Pi et al., 2022), which uses logical indicators to identify logical phenomena in a general text corpus.For QA-centric pre-training, we adopt the semi-supervised pre-training corpus construction method from ProQA (Lewis et al., 2021;Zhong et al., 2022a), which adopts a generationfiltering pipeline to build QA-centric corpus.To help the model in identifying entities from text, we use the self-supervised NER corpus (Chen et al., 2022) built from Wikidata and Wikipedia anchor link.To learn factual knowledge, we use Wikidata as a commonsense knowledge base to construct self-supervised pre-training corpus.Specifically, we sample 1 million fact triples from Wikidata and construct the KG completion task (Moiseev et al., 2022) by recovering the masked tailed entities with the head entities and relations given as inputs.
Furthermore, since natural language inference task already have rich supervised data, we directly use MNLI (Williams et al., 2018) and SNLI (Bowman et al., 2015) datasets as the pre-training corpus.
Finally, 1 million instances are collected for each reasoning skill, and there are 5 millions pre-training instances in total for 5 reasoning skills.The examples and prompts for constructing inputs/outputs of the pre-training corpus are given in Appendix A.

Model Initialization
We adopt encoder-decoder framework.In the encoder, the representation module has 9 Transformer layers, each shared reasoning module has 3 Transformer layers and the maximum reasoning depths is 3.We initialize the major model parameters from pre-trained T5 base (Raffel et al., 2020).Thus, the representation module is initialized by the 1 th → 9 th layers of T5 encoder, and the reasoning module is initialized by 9 th → 12 th layers of T5 encoder.The decoder is the same with T5.

Compared Methods
The major focuses of the experiment are to explore the effectiveness of ReasonFormer, and verify our hypotheses that complex problems can be disentangled and solved by compositional reasoning modules, and the decoupling of representation module and reasoning modules.We compare Rea-sonFormer with two series of methods.MoRM series.Inspired by Mixture-of-Experts (MoE) methods (Shazeer et al., 2017;Lepikhin et al., 2020a) 5 Experiment Analysis

Main Results
As presented in Table 1, ReasonFormer outperform T5 series and MoRM series across all tasks emphasizing the wide scope of different reasoning skills.Thus, we have the following findings: ReasonFormer > MoRM & T5: Reason-Former surpasses other methods (even with more activated parameters) by a large margin, giving evidence to our primary hypothesis that the expertise of reasoning modules and the cascaded compositional reasoning process essentially help the model in solving complex reasoning problems.

RPT-T5 > T5:
The substantial performance boosts brought by RPT demonstrate that reasoningtargeted pre-training is essential in injecting various reasoning abilities into LMs.
Sparse v.s.Full: Sparse activation of RMs leads to slightly reduced but comparable performance compared with full activation.It suggests that although activating more skills is beneficial, the most essential RM still plays the key role in problemsolving.The modularity of RMs can reduce the computation burden while keeping performance.These positive findings manifest that Reason-Former can model compositional reasoning and verify our primary hypothesis that the complex problem can be decomposed and well solved with pre-trained basic skills, and the representation module can be decoupled with the reasoning modules.

Ablation Study
We explore the truly functional components of Rea-sonFormer through ablation studies on 7 datasets.We evaluate the effectiveness of the following components: (1) reasoning pre-training; (2) cascaded reasoning mechanism; (3) expertise of reasoning skills (skill gating loss) and (4) reasoning adapter.
Reasoning Pre-training.We assume that the first factor contributing to the improvements is the multi-task reasoning-centric pre-training.Since vanilla LMs mainly focus on learning contextual semantics, and don't emphasize higher-level reasoning ability (Pi et al., 2022;Helwe et al., 2021), it is intuitive that reasoning-driven pre-training can enhance the model in solving complex problems.Results in Table 2 suggest that the ablation of pre-training from all models leads to a substantial performance drop, showing the importance of reasoning-centric pre-training in helping the reasoning modules to learn fundamental skills.Expertise of RMs.We assume that modularity and expertise of reasoning modules enables them to be flexibly composed.We ablate it from the Single version of ReasonFormer (Line 6) by pretraining all the RMs jointly without skill routing loss ( § 3.3) using the whole pre-training corpus.
The apparent performance drop suggests that the expertise of RMs enables the model to discriminate the functionality of various skills and selectively compose them to form a whole reasoning process.
Reasoning Adapter.The reasoning adapters adapt the shared RMs to different reasoning steps.
It is intuitively important as different levels of cognition focus on the information at different granularity.From Table 2, eliminating the reasoning adapter (Line 3) from ReasonFormer (Cascaded) harms the overall performance, testifying the distinct mechanisms at different levels of reasoning and the importance of reasoning-centric adaptation.

Low-resource Experiments
It is interesting to know whether the fundamental skills of RMs can be easily composed to solve new tasks with limited training data, and whether the representation module and RMs can be decoupled during adaptation.If the answer is true, then the model's generalization ability will be greatly enhanced with easy composition of pre-trained skills.Under these motivations, we conduct few-shot experiments.We first examine the generalization of ReasonFormerand examine the decoupling of modules in ReasonFormer by freezing different modules during learning.We freeze the RMs (Line 3) to testify whether the skills can be directly reused without further fine-tuning.Then we freeze the representation module (Line 4) to verify the decoupling of representation and RMs.From Table 3, we highlight the following findings.
(1) ReasonFormer outperforms T5, showing that the generalization ability of ReasonFormer is enhanced by reasoning pre-training and explicit modeling of the compositional reasoning process.
(2) Freezing RMs (Line 3) achieves comparable and even slightly better performance than its fully tuned version, demonstrating that the learned skills can be easily composed with limited training data without further tuning RMs.
(3) Freezing the representation module (Line 4) also leads to comparable performance, proving that the representation module and RMs can be decoupled in adaptation.It suggests that it is feasible to reduce computation burden during few-shot adaptation by freezing the well-trained representation module and only tuning the RMs for tasks, which is more efficient when the representation module (e.g., gigantic LM) is extremely large for tuning.
(4) Freezing both modules (Line 5) hurts performance, showing that model adaptation to data distribution of specific tasks is still essential.

Reasoning Skills Analysis
Qualitative analyses are conducted to explore how the pre-trained skills are composed to solve different reasoning tasks, and how the skills changed at different reasoning depths.Therefore, we calculate the skill routing weights at every reasoning step (up to 3) for three tasks (i.e., {Commonsense QA, aNLI, Hotpot QA}.The case study provides examples and corresponding (top 2) activated skills at each step.
As shown in Fig. 3, the activated skills are varied for different tasks, and are dynamically composed to form a series of reasoning steps.For commonsense reasoning, it emphasizes {fact, QA}.For NLI task, it emphasize {NER, NLI}.For multi-hop QA task, it executes the QA module for multiple steps.
The statistical analysis of averaged routing scores on the whole evaluation set also demonstrate the same trend.These observations show improved interpretability of decision-making and give evidence to the hypothesis that the compositional cognitive process of humans can be transferred to AI model.

Related Works
Multi-step Reasoning.Multi-step reasoning is a characteristic of human thinking.Multi-hop reasoning (Yang et al., 2018b;Yu et al., 2021) asks the system to logically switch attention to different contexts (Zhong et al., 2022b) or make a multistep deduction for a new conclusion (Dalvi et al., 2019;Zhong et al., 2021).Recently, chain-ofthought prompting (Wei et al., 2022) provides the model with manual prompts about the intermediate reasoning steps.Creswell and Shanahan (2022) use LMs to iteratively select evidence and generate inferences.However, they always require discrete manual-written reasoning traces.Dohan et al. (2022) is a position paper raising interest in mod-eling these cascaded inference processes of LMs with a probabilistic program.
LM Modularity.Since human brains have various functional areas, it is inspiring to explore the modularity of LMs.Mixture-of-Experts (MoE) (Shazeer et al., 2017;Lepikhin et al., 2020b) use experts in FFN layers for sparse learning.However, their major motivation is to increase the model capacity while keeping efficiency, without emphasis on the speciality of expert.Recent works begin to explore domain-specific experts (Gururangan et al., 2021) and modality-specific experts (Wang et al., 2021).SkillNet proposes skill-specific experts (Zhang et al., 2022).However, the activated skills need to be manually specified, and do not explicitly model the cascaded reasoning process and disentangling of perception and cognition.Considering these directions in the whole picture, this paper targets to explore the modeling of modular and the compositional multi-step reasoning process of AI models in an end-to-end manner.

Conclusion
This paper stimulates the compositional reasoning process of humans in decision-making, and makes the following hypotheses: (1) the intuitive perception system (System 1) and cognitive reasoning system (System 2) can be decoupled and (2) the complex decision-making can be disentangled into multi-step execution of fundamental reasoning skills.Correspondingly, we propose Reason-Former, a compositional general-purpose reasoning framework.ReasonFormer decouples the representation module and reasoning modules, which are pre-trained to expert in fundamental reasoning skills.The reasoning modules are dynamically composed in parallel and cascaded manner to form a whole reasoning process.ReasonFormer is end-to-end and unified in solving multiple tasks with one model.Extensive experiments on 11 tasks reveal the compositional reasoning ability of Reason-Former and disentangling of representation and reasoning modules.

Limitations
As mentioned in Sec. 2, the current selection of fundamental reasoning skills for language models is limited by the availability of well-defined tasks and clear definitions of those tasks, as well as the availability of sufficient training data.As a result, some skills may overlap or may not be fundamental enough.For example, simple QA skill may overlap with NER skill to some extent.In the future, it would be worthwhile to explore self-supervised training tasks that can inject more fundamental abilities into language models.Additionally, the selection and combination of fundamental reasoning skills can be further explored.For example, the inclusion of numerical reasoning ability to solve mathematical problems.Additionally, methods for skill-centric pre-training corpus construction can also be explored to improve the effectiveness of these skills.

Ethics Statement
The present study was conducted in accordance with ethical principles.This study involved the analysis using publicly available data and knowledge sources (e.g., Wikipedia).Thus, this work did not involve any human participants and potential risks regarding credentials or privacy.Therefore, no ethical clearance was required and there were no potential risks associated with the conduct of this research.

A Example of Pre-training Tasks
For the basic question answering skill, QA-centric pre-training uses a generation-filtering pipeline to build semi-supervised large-scale corpus (Lewis et al., 2021;Zhong et al., 2022a): (1) use annotated QA data to train a passage-to-question-answer generator (2) taking the wikipedia passages as inputs, and generates corresponding pseudo questions and answers (3) filtering passage, question, answer pairs with a QA model.
For logic skill, we use the automatically constructed data from LogicGAN (Pi et al., 2022).It uses logical indicators (e.g., Therefore, as a result) to automatically identify logical inference phenomenon presented via natural language, and mask corresponding causes/results of events, and ask the pre-trained model to recover them to learn logical reasoning ability.
For the natural language inference, we the public annotated corpus SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018).Given a sentence as premise, the model is expected to predict whether the premise sentence entails the hypothesis sentence.
For the named entity recognition skill, we use weakly-annotated data (Chen et al., 2022) obtained from Wikipedia and Wikidata.The mentions are the text with anchorlink and the types are obtained from Wikidata "instance of" or "subclass of" properties.We design three pretrain tasks similar to Chen et al. (2022): 1) given the sentence, identify all mentions in the sentence 2) given the sentence and interested types, output all mentions with these types in the sentence 3) given the sentence and mentions, predict all types of the mentions.
For the fact skill, we use fact triples from Wikidata, and design a task that predict the tail entity given the head entity and relation as Moiseev et al. (2022).
A summary of the examples for each tasks is presented in Fig 4.

B Implementation Details
Pretraining Details We use "google/t5-v1_1base" from HuggingFace (Wolf et al., 2020) implementation as base model for all our experiments.We use a learning rate of 5e-5 and train all models with 5 epochs.The warmup ratio is set to 0.1.The total batch size is set to 72 for shared model and 64 for private model.The down projection hidden size of adapter is set to 256.We use 8 V100 GPUs for model training.

Figure 1 :
Figure 1: Compositional reasoning process of humans in complex decision-making.Humans solve the problems by cascaded executions of fundamental skills.

Figure 3 :
Figure 3: Case study for activated skills analysis on 3 datasets.The answer is marked with orange.
For all the full data experiments, we use a learning rate of 1e-4 and training epoch of 10 with a batch size of 48 for our models.The model is validated at the end of each epoch.For all the few-shot experiments, we use a learning rate of 1e-5 and training epochs of 200 with a batch size of 8.The model is validated per 200 steps.

Compositional Reasoning Modules (System 2) Iterative cascaded reasoning at the 𝒊 𝒕𝒉 𝒔𝒕𝒆𝒑
Further details are given in § 4.2 and examples are given in Appendix A.

Table 1 :
, we develop Mixture-of-Reasoning Modules (MoRM) methods for comparison.Unlike MoE that builds parallel experts in the FFN layer of Transformer Layers, MoRM builds parallel reasoning modules (RMs) on the top of the representation module, and sparsely activate these RMs.Specifically, after initialized with T5, the last Main results on 11 reasoning tasks.RPT indicates reasoning-centric pre-training introduced in § 2 & § 4.2.S indicates sparse activation (top 2) of RMs, while F denotes full activation of all RMs.
3 Transformer layers in the encoder are duplicated parallelly for N s (numbers of skills) times, and the outputs of them are weighted average by the routing scores of the activated RMs.It increases the model size in the similar way with Reason-Former, so it can verify whether the improvements are brought by the increased parameters.Besides, the major differences between ReasonFormer and MoRM are (1) MoRM involves no cascaded reasoning steps (depth=1); (2) Like MoE, RMs in MoRM are jointly trained for all instances without skill routing loss ( § 3.3), emphasizing no expertise of RMs.We also report the results of MoRM after reasoning-centric pre-training (RPT-MoRM).

Table 3 :
Few-shot experiments after freezing different modules.The representation module is abbreviated as rep.
complexity and composition orders.Single is an ablated version of ReasonFormer in which the reasoning modules are not cascaded horizontally (depth=1) and the adapter is also eliminated.Comparison between performances of Cascaded and Single version of ReasonFormer (Line 1 v.s.Line 4) demonstrates that the cascaded reasoning mechanism brings notable improvements and reveals the effectiveness of multi-step reasoning process.