CogAlign: Learning to Align Textual Neural Representations to Cognitive Language Processing Signals

Most previous studies integrate cognitive language processing signals (e.g., eye-tracking or EEG data) into neural models of natural language processing (NLP) just by directly concatenating word embeddings with cognitive features, ignoring the gap between the two modalities (i.e., textual vs. cognitive) and noise in cognitive features. In this paper, we propose a CogAlign approach to these issues, which learns to align textual neural representations to cognitive features. In CogAlign, we use a shared encoder equipped with a modality discriminator to alternatively encode textual and cognitive inputs to capture their differences and commonalities. Additionally, a text-aware attention mechanism is proposed to detect task-related information and to avoid using noise in cognitive features. Experimental results on three NLP tasks, namely named entity recognition, sentiment analysis and relation extraction, show that CogAlign achieves significant improvements with multiple cognitive features over state-of-the-art models on public datasets. Moreover, our model is able to transfer cognitive information to other datasets that do not have any cognitive processing signals.


Introduction
Cognitive neuroscience, from a perspective of language processing, studies the biological and cognitive processes and aspects that underlie the mental language processing procedures in human brains while natural language processing (NLP) teaches machines to read, analyze, translate and generate human language sequences (Muttenthaler et al., 2020). The commonality of language processing shared by these two areas forms the base of * Corresponding author cognitively-inspired NLP, which uses cognitive language processing signals generated by human brains to enhance or probe neural models in solving a variety of NLP tasks, such as sentiment analysis (Mishra et al., 2017;Barrett et al., 2018), named entity recognition (NER) (Hollenstein and Zhang, 2019), dependency parsing (Strzyz et al., 2019), relation extraction (Hollenstein et al., 2019a), etc. In spite of the success of cognitively-inspired NLP in some tasks, there are some issues in the use of cognitive features in NLP. First, for the integration of cognitive processing signals into neural models of NLP tasks, most previous studies have just directly concatenated word embeddings with cognitive features from eye-tracking or EEG, ignoring the huge differences between these two types of representations. Word embeddings are usually learned as static or contextualized representations of words in large-scale spoken or written texts generated by humans. In contrast, cognitive language processing signals are collected by specific medical equipments, which record the activity of human brains during the cognitive process of language processing. These cognitive processing signals are usually assumed to represent psycholinguistic information (Mathias et al., 2020) or cognitive load (Antonenko et al., 2010). Intuitively, information in these two types of features (i.e., word embeddings and cognitive features) is not directly comparable to each other. As a result, directly concatenating them could be not optimal for neural models to solve NLP tasks.
The second issue with the incorporation of cognitive processing signals into neural models of NLP is that not all information in cognitive processing signals is useful for NLP. The recorded signals contain information covering a wide variety of cognitive processes, particularly for EEG (Williams et al., 2019;Eugster et al., 2014). For different tasks, we may need to detect elements in the recorded signals, which are closely related to specific NLP tasks, and neglect features that are noisy to the tasks.
In order to address the two issues, we propose CogAlign, a multi-task neural network that learns to align neural representations of texts to cognitive processing signals, for several NLP tasks. As shown in Figure 1, instead of simply concatenating cognitive features with word embeddings, we use two private encoders to separately encode cognitive processing signals and word embeddings. The two encoders will learn task-specific representations for cognitive and textual inputs in two disentangled spaces. To align the representations of neural network with cognitive processing signals, we further introduce an additional encoder that is shared by both data sources. We alternatively feed cognitive and textual inputs into the shared encoder and force it to minimize an adversarial loss of the discriminator stacked over the shared encoder. The discriminator is task-agnostic so that it can focus on learning both differences and deep commonalities between neural representations of cognitive and textual features in the shared encoder. We want the shared encoder to be able to transfer knowledge of cognitive language processing signals to other datasets even if cognitive processing signals are not available for those datasets. Therefore, CogAlign does not require cognitive processing signals as inputs during inference.
Partially inspired by the attentive pooling network (Santos et al., 2016), we propose a text-aware attention mechanism to further align textual inputs and cognitive processing signals at the word level.
The attention network learns a compatibility matrix of textual inputs to cognitive processing signals. The learned text-aware representations of cognitive processing signals also help the model to detect task-related information and to avoid using other noisy information contained in cognitive processing signals.
In a nutshell, our contributions are listed as follows: • We present CogAlign that learns to align neural representations of natural language to cognitive processing signals at both word and sentence level. Our analyses show that it can learn task-related specific cognitive processing signals.
• We propose a text-aware attention mechanism that extracts useful cognitive information via a compatibility matrix.
• With the adversarially trained shared encoder, CogAlign is capable of transferring cognitive knowledge into other datasets for the same task, where no recorded cognitive processing signals are available.
• We conduct experiments on incorporating eyetracking and EEG signals into 3 different NLP tasks: NER, sentiment analysis and relation extraction, which show CogAlign achieves new state-of-the-art results and significant improvements over strong baselines.
Eye-tracking for NLP. Eye-tracking data have proved to be associated with language comprehension activity in human brains by numerous research in neuroscience (Rayner, 1998;Henderson and Ferreira, 1993). In cognitively motivated NLP, several studies have investigated the impact of eye-tracking data on NLP tasks. In early works, these signals have been used in machine learning approaches to NLP tasks, such as part-of-speech tagging (Barrett et al., 2016), multiword expression extraction (Rohanian et al., 2017), syntactic category prediction (Barrett and Søgaard, 2015). In neural models, eyetracking data are combined with word embeddings to improve various NLP tasks, such as sentiment analysis (Mishra et al., 2017) and NER (Hollenstein and Zhang, 2019). Eye-tracking data have also been used to enhance or constrain neural attention in (Barrett et al., 2018;Sood et al., 2020b,a;Takmaz et al., 2020).
EEG for NLP. Electroencephalography (EEG) measures potentials fluctuations caused by the activity of neurons in cerebral cortex. The exploration of EEG data in NLP tasks is relatively limited. Chen et al. (2012) improve the performance of automatic speech recognition (ASR) by using EEG signals to classify the speaker's mental state. Hollenstein et al. (2019a) incorporate EEG signals into NLP tasks, including NER, relation extraction and sentiment analysis. Additionally, Muttenthaler et al. (2020) leverage EEG features to regularize attention on relation extraction.
Adversarial Learning. The concept of adversarial training originates from the Generative Adversarial Nets (GAN) (Goodfellow et al., 2014) in computer vision. Since then, it has been also applied in NLP (Denton et al., 2015;Ganin et al., 2016). Recently, a great variety of studies attempt to introduce adversarial training into multi-task learning in NLP tasks, such as Chinese NER (Cao et al., 2018), crowdsourcing learning , cross-lingual transfer learning Kim et al., 2017), just name a few. Different from these studies, we use adversarial learning to deeply align cognitive modality to textual modality at the sentence level.

CogAlign
CogAlign is a general framework for incorporating cognitive processing signals into various NLP tasks. The target task can be specified at the predictor layer with corresponding task-specific neural network. CogAlign focuses on aligning cognitive processing signals to textual features at the word and encoder level. The text-aware attention aims at learning task-related useful cognitive information (thus filtering out noises) while the shared encoder and discriminator collectively learns to align representations of cognitive processing signals to those of textual inputs in a unified semantic space. The matched neural representations can be transferred to another datasets of the target task even though cognitive processing signals is not present. The neural architecture of CogAlign is visualized in Figure 1. We will elaborate the components of model in the following subsections.

Input Layer
The inputs to our model include textual word embeddings and cognitive processing signals.
Word Embeddings. For a given word x i from the dataset of a target NLP task (e.g., NER), we obtain the vector representation h word i by looking up a pre-trained embedding matrix. The obtained word embeddings are fixed during training. For NER, previous studies have shown that character-level features can improve the performance of sequence labeling . We therefore apply a character-level CNN framework (Chiu and Nichols, 2016; Ma and Hovy, 2016) to capture the characterlevel embedding. The word representation of word x i in NER task is the concatenation of word embedding and character-level embedding.
Cognitive Processing Signals. For cognitive inputs, we can obtain word-level eye-tracking and EEG via data preprocessing (see details in Section 5.1). Thus, for each word x i , we employ two cognitive processing signals h eye i and h eeg i . The cognitive input h cog i can be either a single type of signal or a concatenation of different cognitive processing signals.

Text-Aware Attention
As not all information contained in cognitive processing signals is useful for the target NLP task, we propose a text-aware attention mechanism to assign text sensitive weights to cognitive processing signals. The main process of attention mechanism consists of learning a compatibility matrix between word embeddings H word ∈ R dw×N and cognitive representations H cog ∈ R dc×N from the input layer and preforming cognitive-wise max-pooling operation over the matrix. The compatibility matrix G ∈ R dw×dc can be computed as follows: where d w and d c are the dimension of word embeddings and cognitive representations, respectively, N is the length of the input, and U ∈ R N ×N is a trainable parameter matrix. We then obtain a vector g cog ∈ R dc , which is computed as the importance score for each element in the cognitive processing signals with regard to the word embeddings, by row-wise max-pooling over G. Finally, we compute attention weights and the text-aware representation of cognitive processing signals H cog as follows:

Encoder Layer
We adopt Bi-LSTMs to encode both cognitive and textual inputs following previous works (Hollenstein and Zhang, 2019; Hollenstein et al., 2019a).
In this work, we employ two private Bi-LSTMs and one shared Bi-LSTM as shown in Figure 1, where private Bi-LSTMs are used to encode cognitive and textual inputs respectively and the shared Bi-LSTM is used for learning shared semantics of both types of inputs. We concatenate the outputs of private Bi-LSTMs and shared Bi-LSTM as input to the task-specific predictors of subsequent NLP tasks. The hidden states of the shared Bi-LSTM are also fed into the discriminator.

Modality Discriminator
We alternatively feed cognitive and textual inputs into the shared Bi-LSTM encoder. Our goal is that the shared encoder is able to map the representations of the two different sources of inputs into the same semantic space so as to learn the deep commonalities of two modalities (cognitive and textual). For this, we use a self-supervised discriminator to provide supervision for training the shared encoder. Particularly, the discriminator is acted as a classifier to categorize the alternatively fed inputs into either the textual or cognitive input. For the hidden state of modality k, we use a self-attention mechanism to first reduce the dimension of the output of the shared Bi-LSTM H s k ∈ R d h ×N : trainable parameters in the model, h s k is the output of self-attention mechanism. Then we predict the category of the input by softmax function: where D(h s k ) is the probability that the shared encoder is encoding an input with modality k.

Predictor Layer
Given a sample X, the final cognitively augmented representation after the encoder layer can be for- For sequence labeling tasks like NER, we employ the conditional random field (CRF) (Lafferty et al., 2001) as the predictor as Bi-LSTM-CRF is widely used in many sequence labeling tasks (Ma and Hovy, 2016;Luo et al., 2018) due to the excellent performance and also in cognitively inspired NLP (Hollenstein and Zhang, 2019;Hollenstein et al., 2019a). Firstly, we project the feature representation H onto another space of which dimension is equal to the number of NER tags as follows: We then compute the score of a predicted tag sequence y for the given sample X: where T is a transition score matrix which defines the transition probability of two successive labels. Sentiment analysis and relation extraction can be regarded as multi-class classification tasks, with 3 and 11 classes, respectively. For these two tasks, we use a self attention mechanism to reduce the dimension of H and obtain the probability of a predicted class via the softmax function.

Adversarial Learning
In order to learn the deep interaction between cognitive and textual modalities in the same semantic space, we want the shared Bi-LSTM encoder to output representations that can fool the discriminator. Therefore we adopt the adversarial learning strategy. Particularly, the shared encoder acts as the generator that tries to align the textual and cognitive modalities as close as possible so as to mislead the discriminator. The shared encoder and discriminator works in an adversarial way.
Additionally, to further increase the difficulty for the discriminator to distinguish modalities, we add a gradient reversal layer (GRL) (Ganin and Lempitsky, 2015) in between the encoder layer and predictor layer. The gradient reversal layer does nothing in the forward pass but reverses the gradients and passes them to the preceding layer during the backward pass. That is, gradients with respect to the adversarial loss ∂L Adv ∂θ are replaced with − ∂L Adv ∂θ after going through GRL.

Training Objective
CogAlign is established on a multi-task learning framework, where the final training objective is composed of the adversarial loss L Adv and the loss of the target task L T ask . For NER, we exploit the negative log-likelihood objective as the loss function. Given T training examples (X i ; y i ) 1 , L T ask is defined as follows: where y denotes the ground-truth tag sequence. The probability of y is computed by the softmax function: p(y|X) = e score(X,y) y∈Y e score(X, y) For sentiment analysis and relation extraction tasks, the task objective is similar to that of NER. The only difference is that the label of the task is changed from a tag sequence to a single class.
The adversarial loss L Adv is defined as: where θ s and θ d denote the parameters of the shared Bi-LSTM encoders S and modality discriminator D, respectively, X i k is the representation of sentence i in a modality k. The joint loss of CogAlign is therefore defined as:

Inference
After training, the shared encoder learns a unified semantic space for representations of both cognitive and textual modality. We believe that the shared space embeds knowledge from cognitive processing signals. For inference, we therefore only use the textual part and the shared encoder (components in the red dashed box in Figure 1). The private encoder outputs textual-modality-only representations while the shared encoder generates cognitive-augmented representations. The two representations are concatenated to feed into the predictor layer of the target task. This indicates that we do not need cognitive processing signals for the inference of the target task. It also means that we can pretrain CogAlign with cognitive processing signals and then transfer it to other datasets where cognitive processing signals are not available for the same target task.

Experiments
We conducted experiments on three NLP tasks, namely NER, sentiment analysis and relation extraction with two types of cognitive processing signals (eye-tracking and EEG) to validate the effectiveness of the proposed CogAlign.

Dataset and Cognitive Processing Signals
We chose a dataset 2 with multiple cognitive processing signals: Zurich Cognitive Language Processing Corpus (ZuCo) . This corpus contains simultaneous eye-tracking and EEG signals collected when 12 native English speakers are reading 1,100 English sentences. Word-level signals can be divided by the duration of each word. The dataset includes two reading paradigms: normal reading and task-specific reading where subjects exercise some specific task. In this work, we only used the data of normal reading, since this paradigm accords with human natural reading. The materials for normal reading paradigm EARLY first fixation duration (FFD) the duration of word w that is first fixated first pass duration (FPD) the sum of the fixations before eyes leave the word w LATE number of fixations (NFIX) the number of times word w that is fixated fixation probability (FP) the probability that word w is fixated mean fixation duration (MFD) the average fixation durations for word w total fixation duration (TFD) the total duration of word w that is fixated n re-fixations (NR) the number of times word w that is fixated after the first fixation re-read probability (RRP) the probability of word w that is fixated more than once CONTEXT total regression-from duration (TRD) the total duration of regressions from word w w-2 fixation probability (w-2 FP) the fixation probability of the word w-2 w-1 fixation probability (w-1 FP) the fixation probability of the word w-1 w+1 fixation probability (w+1 FP) the fixation probability of the word w+1 w+2 fixation probability (w+2 FP) the fixation probability of the word w+2 w-2 fixation duration (w-2 FD) the fixation duration of the word w-2 w-1 fixation duration (w-1 FD) the fixation duration of the word w-1 w+1 fixation duration (w+1 FD) the fixation duration of the word w+1 w+2 fixation duration (w+2 FD) the fixation duration of the word w+2 consist of two datasets: 400 movie reviews from Stanford Sentiment Treebank (Socher et al., 2013) with manually annotated sentiment labels, including 123 neutral, 137 negative and 140 positive sentences; 300 paragraphs about famous people from Wikipedia relation extraction corpus (Culotta et al., 2006) labeled with 11 relationship types, such as award, education.
We also tested our model on NER task. For NER, the selected 700 sentences in the above two tasks are annotated with three types of entities: PERSON, ORGANIZATION, and LOCATION. All annotated datasets 3 are publicly available. The cognitive processing signals and textual features used for each task in this work are the same as (Hollenstein et al., 2019a).
Eye-tracking Features. Eye-tracking signals record human gaze behavior while reading. The eye-tracking data of ZuCo are collected by an infrared video-based eye tracker EyeLink 1000 Plus with a sampling rate of 500 Hz. For NER, we used 17 eye-tracking features that cover all stages of gaze behaviors and the effect of context. According to the reading process, these features are divided into three groups: EARLY, the gaze behavior when a word is fixated for the first time; LATE, the gaze behavior over a word that is fixated many times; CONTEXT, the eye-tracking features over neighboring words of the current word. The 17 eyetracking features used in the NER task are shown in the Table 1. In the other two tasks, we employed 5 gaze behaviors, including the first fixation duration (FFD), the number of fixations (NFIX), the total fixation duration (TFD), the first pass duration 3 https://github.com/DS3Lab/zuco-nlp/ (FPD), the gaze duration (GD) that is the duration of the first time eyes move to the current word until eyes leave the word.
EEG Features. EEG signals record the brain's electrical activity in the cerebral cortex by placing electrodes on the scalp of the subject. In the datasets we used, EEG signals are recorded by a 128-channel EEG Geodesic Hydrocel system (Electrical Geodesics, Eugene, Oregon) at a sampling rate of 500 Hz with a bandpass of 0.1 to 100 Hz. The original EEG signals recorded are of 128 dimensions. Among them, 23 EEG signals are removed during preprocessing since they are not related to the cognitive processing . After preprocessing, we obtained 105 EEG signals. The left EEG signals are divided into 8 frequency bands by the frequency of brain's electrical signals: theta1 (t1, 4-6 Hz), theta2 (t2, 6.5-8 Hz), alpha1 (a1, 8.5-10 Hz), alpha2 (a2, 10.5-13 Hz), beta1 (b1, 13.5-18 Hz), beta2 (b2, 18.5-30 Hz), gamma1 (g1, 30.5-40 Hz) and gamma2 (g2, 40-49.5 Hz). The frequency bands reflects the different functions of brain cognitive processing. For NER, we used 8 EEG features that are obtained by averaging the 105 EEG signals at each frequency band. For the other two tasks, EEG features were obtained by averaging the 105 signals over all frequency bands. All used EEG features are obtained by averaging over all subjects and normalization.

Settings
We evaluated three NLP tasks in terms of precision, recall and F1 in our experiments. Word embeddings of all NLP tasks were initialized with the publicly available pretrained GloVe (Pennington  Table 2: Results of CogAlign and other methods on the three NLP tasks augmented with eye-tracking features (eye), EEG features (EEG), and both (eye+EEG). 'Base * ' denotes that the model does not use any cognitive processing signals. 'Base' is a neural model that consist of a textual private encoder and textual predictor, and combines cognitive processing signals with word embeddings via direct concatenation, similar to previous works. 'Base+TA' is a neural model where direct concatenation in the base model is replaced by the text-aware attention mechanism. Significance is indicated with the asterisks: * = p<0.01. et al., 2014) vectors of 300 dimensions. For NER, we used 30-dimensional randomly initialized character embeddings. We set the dimension of hidden states of LSTM to 50 for both the private Bi-LSTM and shared Bi-LSTM. We performed 10-fold cross validation for NER and sentiment analysis and 5fold cross validation for relation extraction.

Baselines
We compared our model with previous state-ofthe-art methods on ZuCo dataset. The method by Hollenstein et al. (2019a) incorporates cognitive processing signals into their model via direct concatenation mentioned before.

Results
Results of CogAlign on the three NLP tasks are shown in Table 2. From the table, we observe that: • By just simply concatenating word embeddings with cognitive processing signals, the Base model is better than the model without using any cognitive processing signals, indicating that cognitive processing signals (either eye-tracking or EEG signals) can improve all three NLP tasks. Notably, the improvements gained by eye-tracking features are larger than those obtained by EEG signals while the combination of both does not improve over only using one of them. We conjecture that this may be due to the low signal-to-noise ratio of EEG signals, which further decreases when two signals are combined together.
• Compared with the Base model, the Base+TA achieves better results on all NLP tasks. The text-aware attention gains an absolute improvement of 0.88, 2.04, 0.17 F1 on NER, sentiment analysis, and relation extraction, respectively. With Base+TA, the best results for most tasks are obtained by the combination of eye-tracking and EEG signals. This suggests that the proposed text-aware attention may have alleviated the noise problem of cognitive processing signals.
• The proposed CogAlign achieves the highest F1 over all three tasks, with improvements of 0.48, 2.17 and 0.87 F1 over Base+TA on NER, sentiment analysis and relation extraction, respectively, which demonstrates the effectiveness of our proposed model. In addition, Co-gAlign with both cognitive processing signals obtains new state-of-the-art performance in all NLP tasks. This suggests that CogAlign is able to effectively augment neural models with cognitive processing signals.

Ablation Study
To take a deep look into the improvements contributed by each part of our model, we perform ablation study on all three NLP tasks with two cognitive processing signals. The ablation test includes: (1) w/o text-aware attention, removing text-aware attention mechanism; (2) w/o cognitive loss, discarding the loss of the cognitive predictor whose inputs are cognitive processing signals; (3) w/o modality discriminator, removing the discriminator to train parameters with the task loss. Table 3 reports the ablation study results.  Table 3: Ablation study on the three NLP tasks. Significance is indicated with the asterisks: * = p<0.01. The absence of the text-aware attention, cognitive loss and modality discriminator results in a significant drop in performance. This demonstrates that these components all contribute to the effective incorporation of cognitive processing signals into neural models of the three target tasks. CogAlign outperforms both (2) w/o cognitive loss and (3) w/o modality discriminator by a great margin, indicating that the cognitive features can significantly enhance neural models.
Furthermore, we visualize the distribution of hidden states learned by the shared Bi-LSTM to give a more intuitive demonstration of the effect of adversarial learning. In Figure 2, clearly, the modality discriminator with adversarial learning forces the shared Bi-LSTM encoder to align textual inputs to cognitive processing signals in the same space.

Text-aware Attention Analysis
In addition to denoising the cognitive processing signals, the text-aware attention mechanism also obtains the task-specific features. To have a clear view of the role that the text-aware attention mechanism plays in CogAlign, we randomly choose samples and visualize the average attention weights over each signal in Figure 3.
For eye-tracking, signals reflecting the late syn- tactic processing, such as 'NFIX' (number of fixation), 'TFD' (total fixation duration), play an important role in the three tasks. These results are consistent with findings in cognitive neuroscience. In cognitive neuroscience, researchers have shown that readers tend to gaze at nouns repeatedly (Furtner et al., 2009) (related to the eye-tracking signal NFIX, the number of fixations) and there is a dependency relationship between regression features and sentence syntactic structures (Lopopolo et al., 2019). In other NLP tasks that infused eye-tracking features, the late gaze features have also proved to be more important than early gaze features, such as multiword expression extraction (Rohanian et al., 2017). Moreover, from the additional eye-tracking used in NER, we can find that the cognitive features from the neighboring words are helpful to identify entity, such as 'w-2 FP' (w-2 fixation probability), 'w+1 FP' (w+1 fixation probability).
Since a single EEG signal has no practical meaning, we only visualize the attention weights over EEG signals used in the NER task. Obviously, attentions to 't1' (theta1) and 'a2' (alpha2) are stronger than other signals, suggesting that low frequency electric activities in the brain are obvious when we recognize an entity.  Table 4: Results of CogAlign in transfer learning to other datasets without cognitive processing signals. 'baseline' is a model trained and tested with one encoder for textual inputs. 'baseline (+ZuCo text)' is the baseline trained with both Zuco textual data and target dataset (i.e., Wikigold or SST). 'baseline (two encoders)' is the same as CogAlign (the inference version), where cognitive processing signals are replaced by textual inputs.

Transfer Learning Analysis
The cognitively-inspired NLP is limited by the collection of cognitive processing signals. Thus, we further investigate whether our model can transfer cognitive features to other datasets without cognitive processing signals for the same task. We enable transfer learning in CogAlign with a method similar to the alternating training approach (Luong et al., 2016) that optimizes each task for a fixed number of mini-batches before shifting to the next task. In our case, we alternately feed instances from the ZuCo dataset and those from other datasets built for the same target task but without cognitive processing signals into CogAlign. Since CogAlign is a multi-task learning framework, model parameters can be updated either by data with cognitive processing signals or by data without such signals, where task-specific loss is used in both situations. Please notice that only textual inputs are fed into trained CogAlign for inference.
To evaluate the capacity of CogAlign in transferring cognitive features, we select benchmark datasets for NER and sentiment analysis: Wikigold (Balasuriya et al., 2009) and Stanford Sentiment Treebank (Socher et al., 2013). Since no other datasets use the same set of relation types as that in ZuCo dataset, we do not test the relation extraction task for transfer learning. To ensure that the same textual data are used for comparison, we add a new baseline model (baseline (+Zuco text)) that is trained on the combination of textual data in ZuCo and benchmark dataset. Additionally, as CogAlign uses two encoders for inference (i.e., the textual encoder and shared encoder), for a fair comparison, we setup another baseline (baseline (two encoders)) that also uses two encoders fed with the same textual inputs. The experimental setup is the same as mentioned before.
Results are shown in the Table 4. We can observe that CogAlign consistently outperforms the two baselines. It indicates that CogAlign is able to effectively transfer cognitive knowledge (either eye-tracking or EEG) from ZuCo to other datasets. Results show that the best performance is achieved by transferring both eye-tracking and EEG signals at the same time.

Conclusions
In this paper, we have presented CogAlign, a framework that can effectively fuse cognitive processing signals into neural models of various NLP tasks by learning to align the textual and cognitive modality at both word and sentence level. Experiments demonstrate that CogAlign achieves new state-ofthe-art results on three NLP tasks on the Zuco dataset. Analyses suggest that the text-aware attention in CogAlign can learn task-related cognitive processing signals by attention weights while the modality discriminator with adversarial learning forces CogAlign to learn cognitive and textual representations in the unified space. Further experiments exhibit that CogAlign is able to transfer cognitive information from Zuco to other datasets without cognitive processing signals.