Competing Independent Modules for Knowledge Integration and Optimization

This paper presents a neural framework of untied independent modules, used here for integrating off the shelf knowledge sources such as language models, lexica, POS information, and dependency relations. Each knowledge source is implemented as an independent component that can interact and share information with other knowledge sources. We report proof of concept experiments for several standard sentiment analysis tasks and show that the knowledge sources interoperate effectively without interference. As a second use-case, we show that the proposed framework is suitable for optimizing BERT-like language models even without the help of external knowledge sources. We cast each Transformer layer as a separate module and demonstrate performance improvements from this explicit integration of the different information encoded at the different Transformer layers .


Introduction
Pre-trained language models such as BERT (Devlin et al., 2019) are trained on large corpora with unsupervised end-to-end training. Such monolithic systems cannot take advantage of extant outside grammatical or domain knowledge as complementary information.
Many language tasks benefit from knowledge sources that are known a priori. For sentiment analysis, for instance, various gazetteer lists and sentiment lexica encode sentiment words, word polarity, aspect-sentiment pairs, etc., which were proven to be effective as knowledge sources in different machine learning architectures (Özdemir and Bergler, 2015;Yang et al., 2019;Zhao et al., 2020;Ke et al., 2020).
Deep learning systems for sentiment analysis leverage sentiment words to enhance embedding representations by continuing the pre-training process of masked language models (Tian et al., 2020), or by re-training a modified version of language model that has intermediate layers for explicit encoding of sentiment knowledge (Ke et al., 2020). This knowledge integration into pre-trained models exceeds fine-tuning in computational cost and requires sophisticated training phase calibration. Moreover, reported approaches only encode a single type of knowledge, either lexical (Tian et al., 2020) or grammatical (Tang et al., 2020). For tasks that benefit from several types of knowledge, repeated retraining becomes prohibitive.
This paper demonstrates the feasibility of using standard pre-trained language models and incorporate external off the shelf knowledge sources through dedicated and independent modules for each knowledge component. This framework reduces to some degree the need of ML experts for feature engineering for specific tasks, as creation of gazetteer lists and specialist lexica with domain scores is accessible to domain experts.
Modules incorporating knowledge sources are independent of one another to address major issues with combining extant knowledge with monolithic architectures, robustness, flexibility, and transparency.
Robustness Independent modules make it possible to exploit multiple, possibly inconsistent knowledge sources in parallel while reducing interference effects. Different knowledge sources that do not always agree on facts can cover a wider spectrum for the task at hand, but the inconsistency might hurt the leaning process. By spreading backpropagation independently over each module, training will weigh and assess usefulness of each module in context of the task and other modules. Schölkopf et al. (2012), Goyal et al. (2019) have shown that independent modules make the overall system more robust in case of distribution shifts.
Flexibility Since the modules are independent of one another, a module can be added/removed without further adjustments and, as our ablation experi-ments show, with only commensurate loss. Moreover, the independent modules can be deployed at different points in the architecture. This paper demonstrates flexibility of the approach by showing effectiveness for input oriented modules that are concerned with token level information (i.e. sentiment value or POS tag), with relation information between tokens (grammatical dependencies), and modules at the Transformer layer level, taking each Transformer layer as an independent source of information and improving performance significantly. At the Transformer layer level, modules permit the state of higher Transformer layers to influence the state of lower layers at low cost.
Transparency Module activation can be tracked. Because the modules here encode different components independently, their activation patterns can be visualized and analyzed, an important advantage, especially for domain expert developers who might not be familiar with development environments used by ML experts. This paper presents a proof of concept of the interacting independent module framework on various sentiment analysis datasets. Sentiment analysis is a much studied topic with different general sentiment lexica readily available and well understood benefits from dependency parses and domain specific sentiment lexica. The sentiment tasks we use span a range from simple analysis of a datapoint as two class classification task, as a three class classification, as a relationship oriented aspect based sentiment classification, and finally sentiment classification for tweets expressing figurative language. This variety of task structures for which we can use the same knowledge sources makes the results comparable and showcases the flexibility and robustness of the modules.

Related literature
Neural modular design has been the topic of interest for more than three decades (Bottou and Gallinari, 1991), (Jacobs et al., 1991), (Ronco et al., 1996), (Reed and De Freitas, 2016). Most models proposed assume that only one expert is active at a particular time step but EntNet (Henaff et al., 2017) and IndRNN (Li et al., 2018) are propoals for sets of separate recurrent models, offering module independence, but no communication between modules. The recently proposed Recurrent Independent Mechanisms (RIMs) (Goyal et al., 2019), however, suggest to model a complex system by dividing the overall model into M communicative recurrent modules. The RIMs architecture was introduced for visual input. The independent mechanisms operate on the same input and do not have access to external information but rather make each module to specialize on a simpler problem by focusing in different parts of input.
Attempts at importing outside knowledge into neural architectures for language tasks have experimented with stacking (bi-)LSTMs (Søgaard and Goldberg, 2016), where we could interpret each layer of (bi-)LSTMs as a different module but with no independence and only one-way communication. For transformer architectures, adapters form a kind of module (Houlsby et al., 2019;Pfeiffer et al., 2020). Adapters are trainable modules and can be interspersed between attention layers of frozen language models to provide a boost by learning task specific representations.
The current proposal draws on several of these previous systems for a comprehensive architecture, where the independent modules can take different encodings as input.

The proposed framework
Token level knowledge sources Suppose there are N knowledge sources (n = 1, . . . , N ) available. The annotations provided by nth knowledge source is encoded by an embedder E n . Formally, E n produces a sequence x n 1 , x n 2 , . . . , x n T such as a token embeddings sequence, gazetteer lookup sequence, etc.)

Recurrent modules
The output sequence of each embedder E n , is used as input to a recurrent module R n (n = 1, . . . , N ). The modules are independent in their dynamics and can be chosen independently of any recurrent model, such as simple RNNs (Elman, 1990), GRUs (Cho et al., 2014), LSTMs (Hochreiter and Schmidhuber, 1997), Graph LSTMs (Peng et al., 2017), etc. We associate two hidden states with each module R n at time-step t, a temporary hidden stateh n t ∈ R d h and an actual hidden state h n t ∈ R d h .
Controller A controller component C, in this paper a LSTM, schedules read operations. At time-step t, the controller has the hidden state z t ∈ R dcont and attends to the hidden states of all modules at t − 1 and to position t on all of N ...
Interaction Interaction Figure 1: An illustration of the proposed framework with 3 recurrent modules (R 1 -R 3 ) and k = 2. Each module processes a knowledge source; Language model → R 1 , POS → R 2 , and Sentiment lexicon → R 3 input sequences: where z t−1 is the previous hidden state of the controller and where Q t−1 = z t−1 W query and where W query ∈ R d cont ×d query is the linear transformation for constructing query and W key n ∈ R (d h +d in )×d key and W val n ∈ R (d h +d in )×d val are linear transformations for constructing keys and values in the attention mechanism (Vaswani et al., 2017).
Note that the hidden state of the controller, z t , is used to construct the query Q t , to attend to the input sequences at the next time-step. In Equation 2 the sof tmax produces N attention scores, each corresponding to a module. The top k modules form a subset S t of recurrent modules that are active and thus will be updated by their respective input. Inactive modules output their input unchanged.
Updating recurrent modules All active modules produce a temporary hidden state: Interaction The module R n attends attends to all other modules : are linear transformations for constructing key and value for the interaction attention mechanism (Eq. 4) respectively. An illustration of the proposed model is provided in Figure 1. Active modules are indicated in dark gray. At each time-step the controller determines the set of top k active modules (in the figure k = 2) by attending to inputs as well as all hidden states.
SE17-4A SemEval 2017 task 4 subtask A for 3class sentiment classification of tweets into negative, neutral and positive classes (Rosenthal et al., 2017) SE14-5L SemEval 2015 task 5 for aspect-based sentiment analysis of online reviews of laptops. SE15-5L is a relation extraction and classification task.
SE14-5R SemEval 2015 task 5 for aspect-based sentiment analysis of online reviews of restaurants SE15-5R is a relation extraction and classification task.

Experiments
Our experiments divide into two sets: first, we show through ablation that importing external knowledge with interacting independent modules is effective for all tasks and that the modules do not interfere with each other. A second set of experiments makes each Transformer layer in two BERT-like models an independent interacting module and shows improved performance.

External knowledge sources
The most widely used external knowledge stems from pre-trained word embeddings. All experimental runs contain a module for token embeddings. We compare BERT (Devlin et al., 2019) and RoBERTa .
Task oriented knowledge sources for sentiment tasks include six sentiment lexica. We use three general sentiment lexica, which range from small to very big and from manually to automatically derived (AFINN (Nielsen, 2011), MPQA (Wilson et al., 2005), NRC HashTag Sentiment (Mohammad et al., 2013)) as well as NRC EmoLex (Mohammad and Turney, 2013). We also use two aspect specific lexica for the restaurant and laptop domain (Yelp (Kiritchenko et al., 2014), LapTop (Kiritchenko et al., 2014)).
Part-of-speech (POS) tags are the most widely used grammatical feature and are available from many standard NLP environments. We use ANNIE for POS tagging (Cunningham et al., 2002). strate the efficacy of dependency relations especially for aspect-based sentiment analysis. We use the Stanford Parser (Klein and Manning, 2003;de Marneffe et al., 2006) for extracting dependency relations. 2

Implementation of modules
Here, an embedder E n is either a pre-trained language model or a learnable embedding layer. For the token at position t, E n emits its knowledge representation x n t ∈ R d n in , which is then used as input to the recurrent layer R n which can be a LSTM, GRU, simple RNN, etc. 3. Sentiment: The AFINN, NRC HashTag Sentiment, Yelp, and Laptop lexica return sentiment scores or polarities numerically and can be directly used as input for their dedicated modules. The MPQA polarities N egative, N eutral, and P ositive are encoded as −1, 0, and +1. NRC EmoLex returns eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive), which directly maps into a n-hot vector representation. All sentiment score representations form input to bi-RNNs.

Dependencies:
We use the Graph-LSTM model proposed by (Peng et al., 2017) to encode dependency information, because, its recurrent dynamics easily fit into the framework.
The Graph-LSTM model encodes dependency relations using a bi-directional recurrent architecture, where the forward pass encodes all of the dependencies from a dependency parse where the dependent follows the governor, and the backward pass encodes those dependencies, where the dependent precedes the governor in the input sequence (see Figure 2). At time-step t the input to the recurrent module is the token at position t as well as the hidden states at all time points corresponding to its governors. Note that for high-dimensional and complex inputs such as word embeddings and POS embeddings, we used LSTMs. In order to keep the model light-weight, however, we used simple RNNs for the simpler encodings of sentiment lexica.
The models are implemented using PyTorch (Paszke et al., 2017). To calculate classification loss we use cross-entropy loss and we optimize the models using the Adam optimizer (Kingma and Ba, 2015) and the set of hyper-parameters is provided in Figure 3.

Results
We conduct ablation experiments for the different knowledge modules over all tasks, distinguishing the cases when all of the modules are active (Figure 4) and when only the top k modules are active ( Figure 5). To highlight the potential of modules to compensate for the loss of fine tuning, we report the performance for frozen language models.
Baselines We report the performance of BERT and RoBERTa with no additional modules for comparison. Within our framework, the baseline case is when Token embeddings are the only input (N = 1) and the model is reduced to a simple bi-LSTM over BERT or RoBERTa embeddings as input. Figure 4 suggests that this baseline is at least equivalent to the fine-tuned language models, as it never underperforms them.
All modules active Figure 4 shows that all runs consistently benefit from each of the sentiment lexica. Moreover, the difference between the different lexica is consistently surprisingly small. As expected, HashTag Sent. lexicon shows slightly greater improvements for the tweet data sets SE17-4 and SE15-11 4 and that the two specialized lexica LapTop and Yelp show slightly greater improvements in SE14-L and SE14-R. This pattern suggests that the system has properly attended to the different lexica. Grammatical knowledge from POS and dependency relations also provides greater improvements for the aspect-based tasks, confirming the importance of grammatical knowledge for relation extraction. While both grammatical features yield only marginal improvements, their combination yields consistently better results, more notably for BERT-based settings and most significantly for the relation tasks SE14-L and SE14-R. Moreover, the two grammatical knowledge sources never lower performance.
A consistent observation among all settings is that the modules combine without loss in performance, and best results are consistently achieved when all modules are implemented (N = 9). Note that when all modules are active, no controller component is needed. Figure 4 also shows results for the frozen language models. For all five tasks and both language models, the full set of nine models fully compensates for fine tuning and even slightly increases performance above the fine-tuned baseline. Freezing language models can prevent over-fitting on small data sets. When language models are frozen in the Adapter framework (Pfeiffer et al., 2020), the Adapter modules become responsible for learning the inductive bias. In our framework in contrast, learning is facilitated by extant domain knowledge at a much reduced parameter space. On average, when the language models are frozen, the proposed    Top k modules active When the set of active modules is limited to the top k, the modules compete with each other for active status ( Figure 5). Interestingly, for all tasks, limiting the number of modules yields better performance and confirms the importance of sparse activation of the modules. For fine-tuned models, the best performer varies between 6, 7, and 8 active models. The differences are very small and thus merely suggestive. Interestingly, however, for the frozen language models, k = 7 is the most frequent best performer with the restaurant domain being an exception. The Figurative language task SE15-11 shows the only task for which BERT and RoBERTa frozen models differ in this respect.
Allowing only a set of top modules to be active resembles hard attention with two major differences: there is no need to apply a threshold value to the attention scores here (the threshold is the number of active modules k) and activity/inactivity is determined based on competition among modules in our framework.
Competition between modules fosters independence among learned mechanisms, making each module specialize on a simpler aspect of the problem (Goyal et al., 2019;Parascandolo et al., 2018). Here, we demonstrate system behavior by varying the number of active modules (k) manually. The k values for best-preforming settings fall within a narrow interval, suggesting that automatic mechanisms can determine k during training.

Integrating Transformer layers
The smallest version of BERT consists of 12 layers of Transformer encoders. Jawahar et al. (2019);Tenney et al. (2019);de Vries et al. (2020) all argued that layers in BERT-style models encode different information. For instance, (Jawahar et al., 2019) claim that phrase-level information is encoded in lower layers of BERT and intermediate layers encode linguistic information, with surface features at the bottom and syntactic features in the middle. Tenney et al. (2019) also demonstrate that lower layers in BERT provide richer representation  for POS tagging, concluding that the low-level layers implicitly encode POS information. For most tasks however, only the last layer of BERT-like language models is used to make predictions. We demonstrate that the framework for independent, interacting modules, while useful for incorporating external knowledge sources into a neural architecture, is more generally beneficial. We encode each layer of two BERT-like language models in separate modules, thus enabling lower layers to have access to the representations of higher layers (bi-directional flow of information) and vice versa. We hypothesize that the framework can effectively combine all layers and yield improvements especially for tasks where different levels of representations are required, such as relation detection. Figure 6 show performances across the tasks, when the representations provided by Transformer layers are integrated in our independent modules. Integrating all 12 layers yields consistent improvements across all tasks when compared to the output of layer 12. This demonstrates the potential of bi-directional flow of information and the selfawareness of all intermediate layers. Figure 7 shows results for running all 12 modules, but limiting activity to the top k. Only three different values for k are shown, 1, 6, and 12. Consistently, for all tasks and for both, fine-tuned and frozen models, k = 6 shows top performance, confirming the previous observation that competition increases performance.

Results
The difference for the first rows in Figures 6 and  7 is instructive: The first row in Figure 6 shows performance of one single module with input from Transformer layer 12, while Figure 7 shows 12 modules with a limit of k = 1. Almost always, the 12 modules that select a top k find a slight improvement, which must be due to the interaction: while non-active modules do not update a hidden state, their hidden state is the hidden state of the previous time step. The active module can inspect these hidden states and thus potentially gain information.
To gain an overview over all layers and for all tasks, Figure 8 shows the percentage of time-steps where the independent modules for different Transformer layers have been active. For all tasks, the last layer is most active. This is not surprising since this layer is the target when pre-training language models. Interestingly, the first two layers also consistently demonstrate high activity, for all tasks. We surmise that this may be due to the fact that the sentiment tasks are token-oriented and the first two layers might capture lexical-triggers. The pronounced spike in activity for layer 7 for the aspect based tasks SE14-R and SE14-L might, likewise, confirm the conjecture by Jawahar et al. (2019) that intermediate layers encode grammatical relations, here possibly dependency relations.

Discussion
The reported experiments demonstrate the capabilities of the competing independent modules both for leveraging external knowledge and integrating Transformer layers of BERT-like models. In both cases, the integration leads to improvements.
Integration of layers is a strong competitor for independent modules that leverage external knowledge. This suggests that the required knowledge is already encoded in the language models to some extent. The question is: which approach is preferred? The embedding dimensions for external knowledge sources such as sentiment lexcia and POS are very small. This leads to a small number of trainable parameters in the independent modules that encode these knowledge sources. When integrating layers, since the input dimensions for all independent module is the same (768 in our experiments), the number of trainable parameters is significantly larger compared to the case of external knowledge sources. The choice between the two options depends on the target task and the availability of task specific knowledge sources. When these resources are available, the reduction in development and processing effort becomes very attractive, especially for small datasets.

Conclusions
This paper presents a proof of concept for integrating external knowledge in competing, interacting independent modules. The reported experiments show consistent improvements when using readily available, off-the-shelf knowledge sources such as sentiment lexica, POS, and dependency relations encoded in independent modules. This is true even for knowledge sources that contradict each others' information, showing the robustness of the approach. When modules are in competition mode, further improvements can be achieved.
Experiments with two frozen language models demonstrate that task-specific knowledge sources in this architecture more than compensate for finetuning of the language model, with a significant reduction in the number of trainable parameters.
We also show that the proposed framework is suitable for integration of the Transformer layers of Transformer-based language models by allowing lower layers to have access to the representations of high layers, i.e. bottom-up and top-down flow of information.
Moreover, the behavior of the independent modules can be visualized and the contribution of each module can be measured.
In summary, interacting independent modules are a framework that enables computation restrained task adaptation with off-the-shelf external resources in a transparent fashion.