Can Brain Signals Reveal Inner Alignment with Human Languages?

Brain Signals, such as Electroencephalography (EEG), and human languages have been widely explored independently for many downstream tasks, however, the connection between them has not been well explored. In this study, we explore the relationship and dependency between EEG and language. To study at the representation level, we introduced \textbf{MTAM}, a \textbf{M}ultimodal \textbf{T}ransformer \textbf{A}lignment \textbf{M}odel, to observe coordinated representations between the two modalities. We used various relationship alignment-seeking techniques, such as Canonical Correlation Analysis and Wasserstein Distance, as loss functions to transfigure features. On downstream applications, sentiment analysis and relation detection, we achieved new state-of-the-art results on two datasets, ZuCo and K-EmoCon. Our method achieved an F1-score improvement of 1.7% on K-EmoCon and 9.3% on Zuco datasets for sentiment analysis, and 7.4% on ZuCo for relation detection. In addition, we provide interpretations of the performance improvement: (1) feature distribution shows the effectiveness of the alignment module for discovering and encoding the relationship between EEG and language; (2) alignment weights show the influence of different language semantics as well as EEG frequency features; (3) brain topographical maps provide an intuitive demonstration of the connectivity in the brain regions. Our code is available at \url{https://github.com/Jason-Qiu/EEG_Language_Alignment}.


Introduction
Brain activity is an important parameter in furthering our knowledge of how human language is represented and interpreted (Toneva et al., 2020;Williams and Wehbe, 2021;Reddy and Wehbe, 2021;Wehbe et al., 2020;Deniz et al., 2021).Researchers from domains such as linguistics, psychology, cognitive science, and computer science * *marked as equal contribution have made large efforts in using brain-recording technologies to analyze cognitive activity during language-related tasks and observed that these technologies added value in terms of understanding language (Stemmer and Connolly, 2012).
Basic linguistic rules seem to be effortlessly understood by humans in contrast to machinery.Recent advances in natural language processing (NLP) models (Vaswani et al., 2017) have enabled computers to maintain long and contextual information through self-attention mechanisms.This attention mechanism has been maneuvered to create robust language models but at the cost of tremendous amounts of data (Devlin et al., 2019;Liu et al., 2019b;Lewis et al., 2020;Brown et al., 2020;Yang et al., 2019).Although performance has significantly improved by using modern NLP models, they are still seen to be suboptimal compared to the human brain.In this study, we explore the relationship and dependencies of EEG and language.We apply EEG, a popularized routine in cognitive research, for its accessibility and practicality, along with language to discover connectivity.
Our contributions are summarized as follows: • To the best of our knowledge, this is the first work to explore the fundamental relationship and connectivity between EEG and language through computational multimodal methods.
• We introduced MTAM, a Multimodal Transformer Alignment Model, that learns coordinated representations by hierarchical transformer encoders.The transformed representations showed tremendous performance improvements and state-of-the-art results in downstream applications, i.e., sentiment analysis and relation detection, on two datasets, ZuCo 1.0/2.0 and K-EmoCon.
• We carried out experiments with multiple alignment mechanisms, i.e., canonical correlation analysis and Wasserstein distance, and The architecture of our model, where EEG and language features are coordinately explored by two encoders.The EEG encoder and language encoder are shown on the left and right, respectively.The cross-alignment module is used to explore the connectivity and relationship within two domains, while the transformed features are used for downstream tasks.
proved that relation-seeking loss functions are helpful in downstream tasks.
• We provided interpretations of the performance improvement by visualizing the original & transformed feature distribution, showing the effectiveness of the alignment module for discovering and encoding the relationship between EEG and language.
• Our findings on word-level and sentence-level EEG-language alignment showed the influence of different language semantics as well as EEG frequency features, which provided additional explanations.
• The brain topographical maps delivered an intuitive demonstration of the connectivity of EEG and language response in the brain regions, which issues a physiological basis for our discovery.(Devlin et al., 2019) and found the relationships between these two modalities were generalized across participants.Huang et al. (2020) leveraged CT images and text from electronic health records to classify pulmonary embolism cases and observed that the multimodal model with late fusion achieved the best performance.However, the relationship between language and EEG has not been explored before.

Multimodal Learning of EEG and Language
Foster et al. (2021) applied EEG signals to pre-dict specific values of each dimension in a word vector through regression models.Wang and Ji (2021) used word-level EEG features to decode corresponding text tokens through an open vocabulary, sequence-to-sequence framework.Hollenstein et al. (2021) focused on a multimodal approach by utilizing a combination of EEG, eye-tracking, and text data to improve NLP tasks, but did not explore the relationship between EEG and language.More related work can be found in Appendix E.

Overview of Model Architecture
The architecture of our model is shown in Fig. 1.The bi-encoder architecture is helpful in projecting embeddings into vector space for methodical analysis (Liu et al., 2019a;Hollenstein et al., 2021;Choi et al., 2021).Thus in our study, we adopt the bi-encoder approach to effectively reveal hidden relations between language and EEG.The MTAM, Multimodal Transformer Alignment Model, contains several modules.We use a dual-encoder architecture, where each view contains hierarchical transformer encoders.The inputs of each encoder are EEG and language, respectively.For EEG hierarchical encoders, each encoder shares the same architecture as the encoder module in Vaswani et al. (2017).In the current literature, researchers assume that the brain acts as an encoder for highdimensional semantic representations (Wang and Ji, 2021;Gauthier and Ivanova, 2018;Correia et al., 2013).Based on this assumption, the EEG signals act as low-level embeddings.By feeding it into its respective hierarchical encoder, we extract transformed EEG embeddings as input for the cross alignment module.As for the language path, the language encoder is slightly different from the EEG encoder.We first process the text with a pretrained large language model (LLM) to extract text em-

Experimental Results and Discussions
In this study, we evaluate our method on two downstream tasks: Sentiment Analysis (SA) and Relation Detection (RD) of two datasets: K-EmoCon (Park et al., 2020) and ZuCo 1.0/2.0Dataset (Hollenstein et al., 2018(Hollenstein et al., , 2020b)).Given a succession of word-level or sentence-level EEG features and their corresponding language, Sentiment Analysis (SA) task aims to predict the sentiment label.For Relation Detection (RD), the goal is to extract semantic relations between entities in a given text.More details about the tasks, data processing, and experimental settings can be found in Appendix C.
In Table 1, we show the comparison results of the ZuCo dataset for Sentiment Analysis and Relation Detection, respectively.Our method outperforms all baselines, and the multimodal approach outperforms unimodal approaches, which further demonstrates the importance of exploring the inner alignment between EEG and language.The results of the K-EmoCon dataset are listed in Appendix D

Ablation Study
To further investigate the performance of different mechanisms in the CAM, we carried out ablation experiments on the Zuco dataset, and the results are shown in Table 6 in Appendix D.2.The combination of CCA and WD performed better compared to using only one mechanism for sentiment analysis and relation detection in all model settings.We also conducted experiments on word-level, sentencelevel, and concat word-level inputs, and the results are also shown in Table 6.We observe that word-level EEG features paired with their respective word generally outperform sentence-level and concat word-level in both tasks.

Analysis
To understand the alignment between language and EEG, we visualize the alignment weights of wordlevel EEG-language alignment on the ZuCo dataset.From word level alignment in Fig. 2 and 3, beta2 and gamma1 waves are most active.This is consistent with the literature, which showed that gamma waves are seen to be active in detecting emotions (Li and Lu, 2009) and beta waves have been involved in higher-order linguistic functions (e.g., discrimination of word categories).Hollenstein et al.  Figure 4: Brain topologies.(2021) found that beta and theta waves were most useful in terms of model performance in sentiment analysis.In Kensinger (2009), Kensinger explained that generally, negative events are more likely to be remembered than positive events.Building off of Kensinger (2009), negative words can embed a more significant and long-lasting memory than positive words, and thus may have higher activation in the occipital and inferior parietal lobes.
We performed an analysis of which EEG feature refined the model's performance since different neurocognitive factors during language processing are associated with brain oscillations at miscellaneous frequencies.The beta and theta bands have positively contributed the most, which is due to the theta band power expected to rise with increased language processing activity and the band's relation to semantic memory retrieval (Kosch et al., 2020;Hollenstein et al., 2021).The beta's contribution can be best explained by the effect of emotional connotations of the text (Bastiaansen et al., 2005;Hollenstein et al., 2021).
In Fig. 4, we visualized the brain topologies with word-level EEG features for important and unimportant words from positive and negative sentences in the ZuCo dataset.We deemed a word important if the definition had a positive or negative connotation.'Upscale' and 'lame' are important positive and negative words, respectively, while 'will' and 'someone' are unimportant positive and negative words, respectively.There are two areas in the brain that are heavily associated with language processing: Broca's area and Wernicke's area.Broca's area is assumed to be located in the left frontal lobe, and this region is concerned with the production of speech (Nasios et al., 2019).The left posterior superior temporal gyrus is typically assumed as Wernicke's area, and this locale is involved with the comprehension of speech (Nasios et al., 2019).
Similar to Fig. 2,3, we can observe that beta2, gamma1, and gamma2 frequency bands have the most powerful signals for all words.In Fig. 4, ac-tivity in Wernicke's area is seen most visibly in the beta2, gamma1, and gamma2 bands for the words 'Upscale' and 'Will'.For the word 'Upscale,' we also saw activity around Broca's area for alpha1, al-pha2, beta1, beta2, theta1, and theta2 bands.An interesting observation is that for the negative words, 'Lame' and 'Someone', we see very low activation in Broca's and Wernicke's areas.Instead, we see most activity in the occipital lobes and slightly over the inferior parietal lobes.The occipital lobes are noted as the visual processing area of the brain and are associated with memory formation, face recognition, distance and depth interpretation, and visuospatial perception (Rehman and Khalili, 2019).The inferior parietal lobes are generally found to be key actors in visuospatial attention and semantic memory (Numssen et al., 2021).

Conclusion
In this study, we explore the relationship between EEG and language.We propose MTAM, a Multimodal Transformer Alignment Model, to observe coordinated representations between the two modalities and employ the transformed representations for downstream applications.Our method achieved state-of-the-art performance on sentiment analysis and relation detection tasks on two public datasets, ZuCo and K-EmoCon.Furthermore, we carried out a comprehensive study to analyze the connectivity and alignment between EEG and language.We observed that the transformed features show less randomness and sparsity.The word-level language-EEG alignment clearly demonstrated the importance of the explored connectivity.We also provided brain topologies as an intuitive understanding of the corresponding activity regions in the brain, which could build the empirical neuropsychological basis for understanding the relationship between EEG and language through computational models.

Limitations
Since we proposed a new task of exploring the relationship between EEG and language, we believe there are several limitations that can be focused on in future work.
• The size of the datasets may not be large enough.Due to the difficulty and timeconsumption of collecting human-related data (in addition, to privacy concerns), there are few publicly available datasets that have EEG recordings with corresponding natural language.When compared to other mature tasks, (i.e.image classification, object detection, etc), datasets that have a combination of EEG signals and different modalities are rare.In the future, we would like to collect more data on EEG signals with natural language to enhance innovation in this direction.
• The computational architecture, the MTAM model, is relatively straightforward.We agree the dual-encoder architecture is one of the standard paradigms in multimodal learning.Since our target is to explore the connectivity and relationship between EEG and language, we used a straightforward paradigm.Our model's architecture may be less complex compared to others in different tasks, such as image-text pre-training.However, we purposely avoid complicating the model's structure due to the size of the training data.We noticed when adding more layers of complexity, the model was more prone to overfitting.
• The literature lacks available published baselines.As shown in our paper, since the task is new, there are not enough published works that provide comparable baselines.We understand that the comparison is important, so we implemented several baselines by ourselves, including MLP, Bi-LSTM, Transformer, and ResNet, to provide more convincing judgment and support future work in this area.

Ethics Statement
The goal of our study is to explore the connectivity between EEG and language, which involves human subjects' data and may inflect cognition in the brain, so we would like to provide an ethics discussion.First, all the data used in our paper are publicly available datasets: K-EmoCon and Zuco.We did not conduct any human-involved experiments by ourselves.Additionally, we do not implement any technologies on the human brain.The datasets can be found in Park et al. (2020); Hollenstein et al. (2018Hollenstein et al. ( , 2020b) ) We believe this study can empirically provide findings about the connection between natural language and the human brain.To our best knowledge, we do not foresee any harmful uses of this scientific study.Let X e ∈ R De and X t ∈ R Dt be the two normalized input feature matrices for EEG and text, respectively, where D e and D t describes the dimensions of the feature matrices.To encode the two feature vectors, we feed them to their hierarchical transformer encoders: V e = E e (X e ; W e ); V t = E t (X t ; W t ), where E e and E t denotes the separate encoders, V e and V t symbolizes the outputs for the transformed low-level features and W e and W t denotes the trainable weights for EEG and text respectively.The outputs of these two encoders can be further expanded by stating

A Three paradigms of EEG and language alignment
, where n and k denotes the number of instances in a given output vector and v n e and v k t denotes the instance itself.The details about Transformer encoders are introduced in the section below.

B.2 Transformer Encoders
The transformer is based on the attention mechanism and outperforms previous models in accuracy and performance.The original transformer model is composed of an encoder and a decoder.The encoder maps an input sequence into a latent representation, and the decoder uses the representation along with other inputs to generate a target sequence.Our model only adopts the encoder, since we aim at learning the representations of features.
First, we feed out the input into an embedding layer, which is a learned vector representation.Then we inject positional information into the embeddings by: P E (pos,2i) = sin pos/10000 2i/d model , P E (pos,2i+1) = cos pos/10000 2i/d model (1) The attention model contains two sub-modules, a multi-headed attention model and a fully connected network.The multi-headed attention computes the attention weights for the input and produces an output vector with encoded information on how each feature should attend to all other features in the sequence.
There are residual connections around each of the two sub-layers followed by a layer normalization, where the residual connection means adding the multi-headed attention output vector to the original positional input embedding, which helps the network train by allowing gradients to flow through the networks directly.Multi-headed attention applies a self-attention mechanism, where the input goes into three distinct fully connected layers to create the query, key, and value vectors.The output of the residual connection goes through layer normalization.
In our model, our attention model contains N same layers, and each layer contains two sub-layers, which are a multi-head self-attention model and a fully connected feed-forward network.Residual connection and normalization are added in each sub-layer.So the output of the sub-layer can be expressed as: Output = LayerNorm(x + (SubLayer(x))), For the Multi-head self-attention module, the attention can be expressed as: attention = Attention(Q, K, V ), where multi-head attention uses h different linear transformations to project query, key, and value, which are Q, K, and V , respectively, and finally concatenate different attention results: (2) where the projections are parameter matrices: where the computation of attention adopted scaled dot-product: )V For the output, we use a 1D convolutional layer and softmax layer to calculate the final output.

B.3 Cross Alignment Module
As shown in Fig. 5, there are three paradigms of EEG and language alignment.For word level, the EEG features are divided by each word, and the objective of the alignment is to find the connectivity of different frequencies with the corresponding word.For the concat-word level, the 8 frequencies' EEG features are concatenated as a whole, and then concatenated again to match the corresponding sentence, so the alignment is to find out the relationship within the sentence.As for sentence level, the EEG features are calculated as an average over the word-level EEG features.There is no boundary for the word, so the alignment module tries to encode the embeddings as a whole, and explore the general representations.In the Cross Alignment Module (CAM), we introduced a new loss function in addition to the original cross-entropy loss.The new loss is based on Canonical Correlation Analysis (CCA) (Andrew et al., 2013) and Optimal Transport (Wasserstein Distance).As in Andrew et al. (2013), CCA aims to concurrently learn the parameters of two networks to maximize the correlation between them.Wasserstein Distance (WD), which originates from Optimal Transport (OT), has the ability to align embeddings from different domains to explore the relationship (Chen et al., 2020).
Canonical Correlation Analysis (CCA) is a method for exploring the relationships between two multivariate sets of variables.It learns the linear transformation of two vectors to maximize the correlation between them, which is used in many multimodal problems (Andrew et al., 2013;Qiu et al., 2018;Gong et al., 2013).In this work, we apply CCA to capture the cross-domain relationship.Let low-level transformed EEG features be V e and low-level language features be L t .We assume (V e , V t ) ∈ R n 1 × R n 2 has covariances (Σ 11 , Σ 22 ) and cross-covariance Σ 12 .CCA finds pairs of linear projections of the two views, (w ′ 1 V e , w ′ 2 V t ) that are maximally correlated: In our study, we modified the structure of Andrew et al. (2013) while honoring its duty by replacing the neural networks with Transformer encoders.w * 1 and w * 2 denote the high-level, transformed weights from the low-level text and EEG features, respectively.
Wasserstein Distance (WD) is introduced in Optimal Transport (OT), which is a natural type of divergence for registration problems as it accounts for the underlying geometry of the space, and has been used for multimodal data matching and alignment tasks (Chen et al., 2020;Yuan et al., 2020;Lee et al., 2019;Demetci et al., 2020;Qiu et al., 2022;Zhu et al., 2022).In Euclidean settings, OT introduces WD W(µ, ν), which measures the minimum effort required to "displace" points across measures µ and ν, where µ and ν are values observed in the empirical distribution.In our setting, we compute the temporalpairwise Wasserstein Distance on EEG features and language features, which are (µ, ν) = (V e , V t ).For simplicity without loss of generality, assume µ ∈ P (X) and ν ∈ P (Y) denote the two discrete distributions, formulated as µ = n i=1 u i δ x i and ν = m j=1 v j δ y j , with δ x as the Dirac function centered on x.Π(µ, ν) denotes all the joint distributions γ(x, y), with marginals µ(x) and ν(y).The weight vectors ∈ ∆ m belong to the n− and m−dimensional simplex, respectively.The WD between the two discrete distributions µ and ν is defined as: where Π(u,v)={T ∈R n×m + |T 1m=u,T ⊤ 1n=v}, 1 n denotes an n−dimensional all-one vector, and c (x i , y j ) is the cost function evaluating the distance between x i and y j .

Loss Objective
The loss objective for the CAM module can be formalized as: Loss = l CE + α 1 l CCA + α 2 l W D , where α i ∈ {0, 1}, i ∈ (1, 2) controls the weights of different parts of alignment-based loss objective.
Sentiment Analysis (SA) Given a succession of word-level or sentence-level EEG features and their corresponding language, the task is to predict the sentiment label.The ZuCo 1.0 dataset consists of sentences from the Stanford Sentiment Treebank, which contains movie reviews and their corresponding sentiment label (i.e., positive, neutral, negative) (Socher et al., 2013).The K-EmoCon dataset categorizes emotion annotations as valence, arousal, happy, sad, nervous, and angry.For each emotion, the participant labeled the extent of the given emotion felt by following a Likert-scale paradigm.Arousal and valence are rated 1 to 5 (1: very low; 5: very high).Happy, sad, nervous, and angry emotions are rated 1 to 4, where 1 means very low and 4 means very high.The ratings are dominantly labeled as very low and neutral.Therefore to combat class imbalance, we collapse the labels to binary and ternary settings.
Relation Detection (RD) The goal of relation detection (also known as relation extraction or entity association) is to extract semantic relations between entities in a given text.For example, in the sentence, "June Huh won the 2022 Fields Medal.", the relation AWARD connects the two entities "June Huh" and "Fields Medal" together.The ZuCo 1.0/2.0datasets provide the ground truth labels and texts for this task.We use texts from the Wikipedia relation extraction dataset (Culotta et al., 2006) that has 10 relation categories: award, control, education, employer, founder, job title, nationality, political affiliation, visited, and wife (Hollenstein et al., 2018(Hollenstein et al., , 2020b)).
C.2 Datasets and Data Processing K-EmoCon Dataset K-EmoCon (Park et al., 2020) is a multimodal dataset including videos, speech audio, accelerometer, and physiological signals during a naturalistic conversation.After the conversation, each participant watched a recording of themselves and annotated their own and partner's emotions.Five external annotators were recruited to annotate both parties' emotions, six emotions in total (Arousal, Valence, Happy, Sad, Angry, Nervous).The NeuroSky MindWave headset captured EEG signals from the left prefrontal lob (FP1) at a sampling rate of 125 Hz in 8 frequency bands: delta (0.5-2.75Hz), theta ).We used Google Cloud's Speech-to-Text API to transcribe the audio data into text.
ZuCo Dataset The ZuCo Dataset (Hollenstein et al., 2018(Hollenstein et al., , 2020b) is a corpus of EEG signals and eye-tracking data during natural reading.The tasks during natural reading can be separated into three categories: sentiment analysis, natural reading, and task-specified reading.During sentiment analysis, the participant was presented with 400 positive, neutral, and negative labeled sentences from the Stanford Sentiment Treebank (Socher et al., 2013).The EEG data used in this study can be categorized into sentence-level and word-level features.The sentence-level features are the averaged word-level EEG features for the entire sentence duration.The word-level EEG features are for the first fixation duration (FFD) of a specific word, meaning when the participant's eye met the word, the EEG signals were recorded.For both word and sentence-level features, 8 frequency bands were recorded at a sampling frequency of 500 Hz and denoted as the following: theta1 (4-6Hz), theta2 (6.5-8Hz), alpha1 (8.5-10Hz), alpha2 ).

C.3 Experimental Setup
The hierarchical transformer encoders follow the standard skeleton from Vaswani et al. (2017), excluding its complexity.To avoid overfitting, we adopt the oversampling strategy for data augmentation (Hübschle-Schneider and Sanders, 2019), which ensures a balanced distribution of classes included in each batch.The train/test/validation splitting is (80%, 10%, 10%) as in Hollenstein et al. (2021).The EEG features are extracted from the datasets in 8 frequency bands and normalized with Z-score according to previous work (S. Yousif et al., 2020;Fdez et al., 2021;Du et al., 2022) over each frequency band.To preserve relatability, the word and sentence embeddings are also normalized with Z-scores.We use pre-trained language models to generate text features (Devlin et al., 2019), where all texts are tokenized and embedded using the BERT-uncased-base model.Each sentence has an average length of 20 tokens, so we instantiate a max length of 32 with padding.In the case of word-level, we use an average length of 4 tokens for each word and establish a max length of 10 with padding.The token vectors' from the four last hidden layers of the pre-trained model are withdrawn and averaged to get a final sentence or word embedding.These embeddings are used during the sentence-level and word-level settings.For concat word-level, we simply concatenate the word embeddings for their respective sentence.All the experimental parameters are listed in Appendix C.4.
In this section, we present implementation details for our multilayer perceptron (MLP), ResNet, and BiLSTM models during baseline retrieval.Throughout all baseline results, we used a pre-trained BERTuncased-base model to extract useful features for text.In the case of EEG features, we used the signals as is.Both text and EEG features were normalized with a Z-score before inputting them into the models.We also used the cross-entropy loss function for all baseline results.We configure the MLP with 6 hidden layers.At every step before the last output layer, we established a rectified linear unit activation function and a dropout rate of 0.3.Starting from the input layer, we use a hidden layer sizes of 256, 128, and 64 for our baseline results.Our 1D ResNet architecture has 34 layers (Hong et al., 2020).The BiLSTM

D.4 t-SNE Feature Projections
In order to interpret the performance improvement, we visualized the original feature distribution and the transformed feature distribution.As shown in Fig. 7, the transformed feature distribution makes better clusters than the original one.The features learned by CAM can be more easily separable, showing the effectiveness of discovering and encoding the relationship between EEG and language.Figures 8,9,10 show more t-SNE projection results of the K-EmoCon dataset on Sentiment Analysis task.

D.5 Sentence-level Alignment
Figure 11 shows the negative and positive sentence-level alignment weights of ZuCo dataset.In Figure 11, we can find that alpha1, beta1,and gamma1 frequency bands show larger different response between negative and positive sentences.

D.6 Baseline Results
In this section, we provided baseline results that directly used either EEG, language, or fusion as input for the downstream applications.The results are shown in Table 7 and Table 8.

Figure 1 :
Figure1: The architecture of our model, where EEG and language features are coordinately explored by two encoders.The EEG encoder and language encoder are shown on the left and right, respectively.The cross-alignment module is used to explore the connectivity and relationship within two domains, while the transformed features are used for downstream tasks.

Fig. 2
and Fig. 3 show examples of negative & positive sentence word-level alignment, respectively.The sentence-level alignment visualizations are shown in Appendix D.5.

Figure 5 :
Figure 5: Three paradigms of EEG and language alignment.

Figure 7 :
Figure 7: TSNE projection comparison of untransformed & transformed features of ZuCo dataset, where different colors represent different classes.

Figure 8 :
Figure 8: Transformed feature projections of K-EmoCon dataset on Sentiment Analysis, where different colors represent different classes.

Figure 9 :
Figure 9: Transformed feature projections of ZuCo dataset on Sentiment Analysis, word-level, where different colors represent different classes.

Figure 10 :
Figure 10: Transformed feature projections of ZuCo dataset on Sentiment Analysis, concat word-level, where different colors represent different classes.

Figure 11 :
Figure 11: Negative and Positive sentence-level alignment of ZuCo dataset.

Table 1 :
Comparison with baselines on Zuco dataset for Sentiment Analysis (SA) and Relation Detection (SD).
connectivity-based loss function.In our study, we investigate two alignment methods, i.e., Canonical Correlation Analysis (CCA) and Wasserstein Distance (WD).The output features from the cross alignment module can be used for downstream applications.The details of each part are introduced in Appendix B.3.

Table 2 :
Experiment parameters used in the paper, where the best ones are marked in bold