Improving Federated Learning for Aspect-based Sentiment Analysis via Topic Memories

Aspect-based sentiment analysis (ABSA) predicts the sentiment polarity towards a particular aspect term in a sentence, which is an important task in real-world applications. To perform ABSA, the trained model is required to have a good understanding of the contextual information, especially the particular patterns that suggest the sentiment polarity. However, these patterns typically vary in different sentences, especially when the sentences come from different sources (domains), which makes ABSA still very challenging. Although combining labeled data across different sources (domains) is a promising solution to address the challenge, in practical applications, these labeled data are usually stored at different locations and might be inaccessible to each other due to privacy or legal concerns (e.g., the data are owned by different companies). To address this issue and make the best use of all labeled data, we propose a novel ABSA model with federated learning (FL) adopted to overcome the data isolation limitations and incorporate topic memory (TM) proposed to take the cases of data from diverse sources (domains) into consideration. Particularly, TM aims to identify different isolated data sources due to data inaccessibility by providing useful categorical information for localized predictions. Experimental results on a simulated environment for FL with three nodes demonstrate the effectiveness of our approach, where TM-FL outperforms different baselines including some well-designed FL frameworks.


Introduction
Aspect-based sentiment analysis (ABSA) is one of the most popular natural language processing * Equal contribution. † Corresponding author. 1 The code involved in this paper are released at https: //github.com/cuhksz-nlp/ASA-TM.
(NLP) tasks aiming to predict the sentiment polarity (i.e., "positive", "negative", and "neutral") for an aspect term in sentences. Currently, methods based on deep learning have been widely utilized for ABSA and demonstrated excellent potentials Zadeh et al., 2017;Xue and Li, 2018;Chaturvedi et al., 2018;Xu et al., 2019b). However, these methods still reach a bottleneck if there is no enough labeled training data. One feasible solution for it is to leverage extra labeled data from other sources or domains. However, in real applications, these data are always stored in different locations (nodes) and are inaccessible to each other owing to privacy or legal concerns.
To address the data isolation issue, federated learning (FL) (Shokri and Shmatikov, 2015;Konečnỳ et al., 2016a,b) is proposed and has shown its great promises for many machine learning tasks, such as user-computer interaction (Aono et al., 2017), medical image analysis (Sheller et al., 2018), and financial data analysis (Yang et al., 2019a;He et al., 2020). In some cases, data in different nodes are encrypted and aggregated to the centralized model, and they are invisible to each other during the training stage (Hard et al., 2018). This property makes FL an essential technique for real applications with privacy and security requirements.
Recently, FL has been applied to many downstream natural language processing (NLP) applications (Zhu et al., 2020) such as mobile keyboard prediction (Hard et al., 2018), language model training , representation learning , spoken language understanding , medical relation extraction (Sui et al., 2020), medical named entity recognition (Ge et al., 2020), and news recommendation (Qi et al., 2020). However, conventional FL techniques are more suitable for nodes sharing homogeneous data, which is seldom the case for NLP tasks be-cause text data are usually heterogeneous in vocabularies and expression patterns. Particularly for ABSA, it is sensitive to the domain information, where one particular token may suggest completely different sentiment polarity in different datasets. Therefore, the restricted data access in traditional federated learning approaches could result in inferior performance for ABSA since they cannot update the model using all domain information. Unfortunately, limited attentions have been paid to address this issue. Most existing approaches with FL on NLP (e.g., for language modeling (Hard et al., 2018;Chen et al., 2019), named entity recognition (Ge et al., 2020), and text classification (Zhu et al., 2020)) mainly focus on optimizing the learning process and ignore domain diversities.
In this paper, we propose a neural model based on FL for ABSA in a distributed environment, namely TM-FL, with a topic memory to enhance FL by providing categorical (topic) information for localized predictions, which can address the difficulty of identifying text sources caused by data inaccessibility. Specifically, the topic model serves as a server-side component to read different inputs from each node and respond with categorical weights to help the backbone ABSA classifier. Compared with previous ABSA studies that leverage extra features, e.g., document information (Li et al., 2018a), commonsense knowledge (Ma et al., 2018), and word dependencies (Tang et al., 2020), our approach offers an alternative to improve ABSA by leveraging extra labeled data through the FL framework enhanced by TM. Experimental results on a simulated environment with isolated data from laptop, restaurant reviews, and social media (i.e., Tweets), demonstrate the effectiveness of our approach, where TM-FL outperforms different baselines including the ones with well designed FL framework.

Federated Learning
Federated learning (FL) was first proposed by Google and then further developed by many studies over the past years (Shokri and Shmatikov, 2015;Konečnỳ et al., 2016a,b;McMahan et al., 2017). FL is to build machine-learning models based on datasets distributed across multiple devices while preventing data leakage. Generally, in federated learning, the data is locally stored in different nodes and never uploaded to the server or exchanged with each other node. Thus, the centralized model on the server-side cannot directly exploit the data to optimize its parameters. Instead, each node computes a local model update based on their data, and then the local updates in all nodes are aggregated by the centralized model to optimize parameters. Since such local model updates cannot be directly translated to the original data, the data privacy and security are significantly enhanced. However, there are some other approaches to apply FL, such as sending transformed or encrypted data which cannot be converted to the original data (Hard et al., 2018). FL has been applied to many areas (Yang et al., 2019b;Liu et al., 2020;Wang et al., 2020b;Zheng et al., 2020) and recently, many studies focus on optimizing the learning process (Konečnỳ et al., 2016b;Zheng et al., 2020;Wang et al., 2020b).
Particularly, the FEDERATEDAVERAGING algorithm, proposed by McMahan et al. (2017), is to combine node updates and produce a new global model. At the beginning of each training round, the global model is sent to a subset of nodes. Each of the selected nodes then randomly samples a subset of its local dataset to train the model locally. In the training process, the nodes compute the average gradient on their local datasets with the current global model. The server collects the gradients and aggregates them to update the global model. This process repeats until the global model converges.

Aspect-based Sentiment Analysis
Aspect-based sentiment analysis (ABSA) is a longstanding NLP task of detecting a sentiment polarity towards a given aspect term in a sentence. Many recent studies applied neural network approaches to ABSA Ma et al., 2017;Fan et al., 2018;Gu et al., 2018;He et al., 2018b;Huang and Carley, 2018;Li et al., 2018b;Chen and Qian, 2019;Hu et al., 2019;Du et al., 2019;. Usually, external knowledge is incorporated to obtain better understandings of contextual information so as to enhance the model performance for natural language processing downstream tasks (including ABSA) (Li et al., 2018a;Ma et al., 2018;Chen et al., 2020b;Tang et al., 2020;Chen et al., 2021;Tian et al., 2021b,c,d). However, most previous studies assume an ideal environment where all the data is accessible and visible to each other for the experiments, which is rarely the case in real appli- cations. In this paper, we propose an alternative to handle the data isolation problem and improve ABSA under the constraints of FL by leveraging extra labeled data from different domains through a topic memory module.

The Proposed Method
We propose TM-FL for ABSA and the overall server-node architecture of our approach is illustrated in Figure 2. The centralized model is stored in the FL server and data from multiple sources (domains) are stored at different nodes (the i-th node is denoted by N i ), respectively. Encrypted information (e.g., data, vectors, and loss) communicates between each node N i and the FL server. In this way, the original data stays in the local node and is not accessible to the other nodes. To encode categorical information to facilitate localized prediction, we incorporate TM into the centralized model ( Figure 1). Herein, FL encodes the topic information from the encrypted input and uses the encoded information to guide the centralized model to make a localized prediction. In the following texts, we introduce FL for ABSA and then the centralized model with TM.

Federated Learning
In federated learning, the data is stored in different local nodes and never exchanged with other nodes. Thus, the centralized model cannot directly access these data, but aggregate the encrypted information of data generated by each local node to complete an update in every training round. Specifically, there are different ways to apply federated learning, such as having the clients send the losses and gradients with respect to the local data back to the FL server (Konečnỳ et al., 2016a), or having the clients send encrypted information which cannot be deciphered back about the local data back to the FL server (Hard et al., 2018). In this paper, we follow the paradigm from Hard et al. (2018) to apply federated learning, where encrypted information, including hidden vectors and loss, are transferred between the server and clients. In addition, following , we adopt a modified version of FEDERATEDAVERAGING algorithm in which no models are sent to clients. In the training process of FL, the node N i firstly encrypts the original input sentence n with n words and the aspect term Next, the encrypted X i and A i are sent to the server and fed into the centralized model. Then, the model processes the encrypted input and computes the score vectors o i for all sentiment polarities by where each dimension of o i corresponds to a particular sentiment polarities among positive, negative, and neutral. Afterward, o i is passed back to N i and decoded to the model prediction y i by After that, we apply the negative log likelihood loss function to the sentiment polarity predictions and compute the loss for node N i (i.e., L i ) by where p(y (i) * |X i , A i ) denotes the predicted probability of the ground truth sentiment polarity y (i) * for a given aspect term A i in X i . Finally, L i is passed to the server and backpropagation is applied to update the parameters in the centralized model accordingly. The nodes will host no model but only encrypt the local data and send it to the server. In the following texts, we first describe how we construct a topic memory network and use it to capture domain-specific information. Then we explain how we apply our approach to ABSA.

Centralized Model with Topic Memory
Standard FL cannot utilize the categorical information from the isolated data and thus cannot achieve optimal results for localized prediction. This is a critical barrier for ABSA task, where the data from different sources always contains heterogeneous vocabularies and expressing patterns. In this work, we propose to leverage TM to explore the topic information in the data and use it to guide the centralized model for making localized prediction. As for the input sentence, previous studies concatenate aspect term(s) directly to the end of an input sentence with a special token 2 serving as the separator and feed the resulted sentence+aspect pair into an encoder Zeng et al., 2019;Phan and Ogunbona, 2020;Veyseh et al., 2 If the encoder is BERT, the special token will be [SEP]. 2020; Chen et al., 2020a). This straightforward method has been proved to be effective for ABSA. Following this paradigm, in the centralized model, we concatenate the encrypted X i and A i into a new sequence X E i with a special token inserted between them, formalized by where SE is the encoder for encoding the encrypted information. Based on X i and h X i , TM generates the topic vector (which is denoted as u i ∈ R d h ) through the following process. Firstly, we use a matrix W φ ∈ R dv×dt (d v and d t denote the vocabulary size and the topic size, respectively) to represent the topic model which is to obtain the categorical information, where the matrix W φ is from a pre-trained neural topic model 3 . Each row of W φ can be regarded as a word embedding for a particular word with each dimension of the embedding corresponding to the value for a specific topic. Similarly, each column of W φ can be regarded as a topic embedding for a particular topic. Next, we use W φ to map all words in X i to the corresponding word embeddings (the embedding for the j-th word in X i is denoted as e x i,j ∈ R dt ), and map all topics to the corresponding topic embeddings (the embedding for the k-th topic is denoted as e t k ∈ R dv ). Then, we apply average pooling to the word embeddings over X i We feed e s i,k into a multi-layer perceptron (MLP) to compute the source memories s i,k by Afterward, we compute the attention weights p i,k for the k-th topic by Finally, p i,k is applied to target memories by  Table 1: The number of aspect terms with "positive" (Pos.), "neutral" (Neu.), and "negative" (Neg.) sentiment polarities in the train/test sets of all three datasets.
where the target memory t k ∈ R d h is obtained by We perform element-wise addition on h X i and t i , and pass the resulting vector to a fully connected layer to obtain o i , which can be formalized by where W and b are the trainable matrix and bias vector, respectively, in the fully connected layer.

Datasets
To test the proposed approach, we follow the convention of recent FL-based NLP studies Zhu et al., 2020;Sui et al., 2020; to build a simulated environment where isolated data are stored in three nodes. Each node contains one of the three widely used English benchmark datasets (i.e., LAP14, REST14 (Pontiki et al., 2014), and TWIT-TER (Dong et al., 2014)) for ABSA, where each node contains all the data from the same domain. Particularly, LAP14 contains laptop computer reviews; REST14 consists of online reviews from restaurants; TWITTER includes tweets collected through Twitter API. For LAP14 and REST14, following previous studies (Tang et al., 2016b;He et al., 2018a), the aspect terms with "conflict" sentiment polarity 4 and the sentences without an aspect term are removed. For all datasets, we use their official train/test splits 5 and randomly pick 10% of the training set serving as the development set so as to find the best hyper-parameters, which are then applied to our The statistics of the datasets (i.e., the numbers of aspect terms with "positive", "negative", and "neutral" sentiment polarities) of the three datasets is reported in Table 1.
To further improve the model performance by leveraging extra labeled data from different domains, we train a neural topic model and then obtain the topic-vocab matrix to initialize W φ . We train our neural topic model on five online datasets: (1) Yelp dataset 7 , (2) IMDb dataset 8 , (3) Amazon dataset 9 , (4) SemEval-2017 Task 4 (SemEval2017) dataset 10 , and (5) MitchellAI-13-Opensentiment dataset (Mitchell et al., 2013). Particularly, Yelp contains online reviews of restaurants and hotels; IMDb contains reviews of movies; Amazon includes comments on goods; SemEval2017 and MitchellAI-13-Opensentiment contain tweets. We randomly sample 75K sentences from each domain (i.e., reviews of restaurants and hotels, reviews of movies, comments on goods, and tweets 11 ) and put them together to form the combined training data with roughly 300K sentences 12 for the topic model.

Neural Topic Model
Inspired by Miao et al. (2017), we train a neural topic model based on variational auto-encoder (Kingma and Welling, 2013) to extract the latent topic distribution z with prior parameters µ and φ of the datasets, where the overall structure of the topic model is illustrated in Figure 3. Specifically, given an input sentence X = x 1 x 2 x 3 · · · x n , we first obtain the one-hot representation X bow of X and then pass it to a multi-layer perceptron (MLP) to get the hidden representation h t X of the input sentence, formalized by X bow = one-hot(X ) and where h t X ∈ R d h . Next, the prior parameters µ and σ of the latent topic distribution z are estimated and defined as and where MLP µ and MLP σ refer to two different multi-layer perceptrons. Then, we randomly sample θ from z to be the latent topic representation of the input sentence X . Afterward, we generate the output vector by where W φ ∈ R dv×dt and b φ ∈ R dv are trainable matrix and bias vector, respectively; o t ∈ R n×dv refers to the predicted probability of words from all vocabularies of each position in the original input sentence. In practice, we train the topic model in an unsupervised manner and then extract the topicvocab matrix to initialize W φ . For sampling θ, we sample another random variable ∈ N (0, 1) and then parameterize θ by θ = µ + · σ.

Implementation
In the experiments, we run the baselines without federated learning (i.e. BT-b and BT-l) on the single dataset (i.e. LAP14, REST14, or TWITTER) and the combined dataset consisting of all the three datasets, denoted by the union dataset. However, it is rarely practical to have the model trained on the union dataset in real applications (since the data are isolated in different nodes). Therefore, the experimental results on the union dataset reveal the possible upper-boundary of FL-based models and they are mainly used for reference. For FL baselines (i.e. FL) and our proposed approaches (i.e., TM-FL), we run them in the simulated environment where the LAP14, REST14, and TWITTER  Table 2: Accuracy and Macro-F1 scores of models using BERT-base (BT-b) and BERT-large (BT-l) under different settings on three benchmark datasets.
datasets are isolated to three nodes. Specifically, the first node holds LAP14; the second node holds REST14; the third node holds TWITTER. For encoder, considering that high-quality text representations from pre-trained embeddings or language models are able to effectively to enhance the model performance (Mikolov et al., 2013;Song et al., 2018a,b;Song and Shi, 2018;Devlin et al., 2019;Diao et al., 2020; and BERT-based models have achieved great success in many NLP tasks (Mao et al., 2019;Tang et al., 2020;Tian et al., 2020aTian et al., ,c, 2021bQin et al., 2021a,b), we use the BERT-base-uncased and BERT-large-uncased 13 (Devlin et al., 2019) to encode the encrypted input 14 (i.e., X i and A i ) from N i . For TM, we train our neural topic model using an unsupervised approach proposed by Miao et al. (2017) and then use the resulted topic-vocab matrix to initialize W φ in TM-FL. In the training process of TM-FL, both BERT and W φ are updated. 15 Moreover, it is noted that for baselines (i.e., BT) on the single dataset and the union dataset, we choose the models based on their F1 scores with respect to the dev set of each dataset separately. For FL and TM-FL, we choose the models according to their average F1 score of the three F1 scores over the dev sets of the three datasets. For the evaluation metrics, we follow previous studies (Tang et al., 2016a;He et al., 2018a; to evaluate all models via accuracy and macro-averaged F1 scores over all sentiment polarities, i.e., positive, neutral and negative.

Overall Results
To evaluate the TM-FL's performance, we compare it with 1) the baseline FL models without TM, i.e., FL (BT-b) and FL (BT-l); and 2) two BERT-only models without FL that all training instances are not isolated and they are accessible to each other. Table 2 illustrates the accuracy and F1 scores of our TM-FL models and all the aforementioned baselines on the test set of three benchmark datasets. 16 There are several observations. First, in most cases, models under the FL framework (ID: 3, 4, 7, 8) outperform the models trained on the single datasets (ID: 1, 5) with different encoders. This confirms that FL works well to leverage extra isolated data with both BERT-base and BERT-large encoders. Second, FL baselines (ID: 3, 7) fail to outperform the models trained on the union of all datasets (ID: 2, 6) with different encoders on all datasets, which demonstrates that even though FL can leverage extra isolated data, it still fails to achieve the upper bound performance provided by models (ID 2, 6) that do not suffer from the data isolation problem. Third, our TM-FL models (ID: 4, 8) consistently outperform the FL baselines (ID: 3, 7) on all datasets. In addition, it is promising to observe that some results (e.g., ID: 4 on Lap14) from TM-FL are very close to the reference BERT-only models (ID: 2, 6) that provide potential upper boundaries for FL-based models, which demonstrates the effectiveness of the proposed TM module to leverage categorical information to facilitate localized prediction. Moreover, TM-FL shows higher improvements over FL on LAP14 and REST14 than that on TWITTER, which can  be explained by that LAP14 and REST14 are product reviews focusing on a particular area whereas TWITTER contains social media texts that may share heterogeneity. Such difference, including the difference among domains and within the TWIT-TER domain, distracts the model on TWITTER.

Comparison with Previous Studies
Since our experimental settings are different from the settings of most previous studies on the three benchmark datasets, direct comparisons of our results with previous studies are not valid. Compared with those previous studies focusing on a single domain, FL can access extra data to help the model even though data from different datasets are not visible to each other. For previous studies working on multiple datasets at the same time and leveraging external knowledge, they do not conduct their experiments in an environment suffering from data isolation problems. To provide relatively fair comparisons with previous studies on the single dataset, we build another three simulated environments for FL and TM-FL where a single dataset, instead of the three datasets, is distributed through all the isolated nodes in each environment. Thus, it is ensured that for each dataset, external knowledge is not introduced into the model during the training process. Therefore, to a certain extent, it is relatively valid to compare our results with previous studies on every single dataset, where the comparisons are reported in Table 3. It is noted that although TM-FL suffers from data isolation under the simulation setting, it still outperforms some studies (Mao et al., 2019;Xu et al., 2019b) using BERT-large (marked by "*") and achieve state-of-the-art results on Lap14, which further confirms the effectiveness of our approach to leverage local isolated data. Besides, TM-FL fails to outperform  and Tang et al. (2020) on Rest14 and TWITTER, which could be explained that they leverage dependency Figure 4: A case study on two groups of sentences , where the aspect terms extracted from different nodes (i.e., LAP14, REST14, and TWITTER) are highlighted in red colors , and the predictions of TM-FL (BERT-large) and the FL baseline, as well as the gold labels, are presented below the corresponding sentence. The word list on the right side shows the top four topics (ranked by their receiving wights) in TM.
information and use advanced architectures (e.g., GCN) to encode it.

The effect of Topic Model
To further explore the effect of the topic model, we test FL and our proposed TM-FL on the test sets with randomly initialized W φ and pre-trained W φ obtained from the topic model, and report the results in Table 4. First, it is observed that TM-FL with either pre-trained W φ or randomly initialized W φ outperforms FL, which is reasonable that TM is able to leverage the domain information from extra labeled data and hence help ABSA on localized sentiment polarity prediction. Moreover, TM-FL with pre-trained W φ (T.) outperforms TM-FL with randomly initialized W φ (R.), demonstrating the effectiveness of the topic model to leverage external topic knowledge with regard to specific domains from other datasets to help the centralized model on ABSA in the simulated environment.

Case Study
To examine whether our approach with TM is able to capture categorical information to facilitate localized prediction, we conduct a case study with twosentence groups (i.e., the first group with sentence (1), sentence (2), and the second group with sentence (3), sentence (4)), where all sentences are obtained from different domains (i.e., the test sets of LAP14, REST14, and TWITTER datasets). Figure  4 illustrates such two-sentence groups (the aspect term is highlighted in red color in each sentence), where the predictions from the FL baseline (with BERT-large) and our TM-FL, as well as the gold labels, are also presented. Besides, the top four topic words (ranked based on the received weights in TM) for each individual sentence are presented on the right side. It is worth noting that in each group, both sentences share some same opinion words (i.e. opinion word "hot" and "salas" which are highlighted in yellow, respectively) which convey contradictory sentiment polarities. Specifically, in the first sentence group, the shared opinion word is "hot", which generally demonstrates negative sentiment polarity in laptop reviews while shows positive sentiment polarity in restaurant reviews. In LAP14, among the instances containing "hot", 75% of them are associated with the negative sentiment polarity, whereas in REST14, no more than 1/3 of such instances are associated with the negative sentiment polarity. Compared with FL baselines, our approach enhanced by TM successfully leverage the categorical information and hence is able to distinguish the cues from "hot" in a particular context, where results incorrect predictions for both instances, whereas FL fails to recognize that "hot" suggests a positive sentiment polarity in the sentence (2) from restaurant reviews and thus results in an incorrect prediction. Moreover, in the second sentence group, the word "fresh" serves as the shared opinion word with its sentiment polarity generally being generally positive in the domain of restaurant reviews and neutral in the domain of tweets. FL successfully models the opinion word "fresh" and predict the sentiment polarity for the aspect term "pizza" for sentence (3), while it fails to distinguish the domain difference between sentence (3) and sentence (4). Therefore, due to the cue from "fresh" in restaurant domain, FL incorrectly models the opinion word "fresh" in another domain and hence make incorrect sentiment polarity prediction with regard to the aspect term "britney spears". However, our approach is able to distinguish the domain information in the sentence (3) and sen-tence (4), resulting incorrect predictions for both instances.

Conclusion
In this paper, we present TM-FL, a domain-aware topic memory network under the federated learning framework to enhance ABSA under the restriction of data isolation issues. Specifically, our approach offers an alternative to enhance ABSA by leveraging extra labeled data through the FL framework improved by TM. Experimental results on three widely used English benchmark datasets demonstrate the effectiveness of our method, which outperforms all the baseline models trained under the federated learning framework and competes for state-of-the-art performance on all datasets. Appendix A. Hyper-parameter Settings Table 5 reports the hyper-parameters tested in training our models. We test all combinations of them for each model and use the one achieving the highest accuracy score in our final experiments.

Hyper-parameters Values
Learning Rate 5e − 6, 1e − 5, 2e − 5, 3e − 5 Warmup Rate 0.06, 0.1 Dropout Rate 0.1 Batch Size 8, 16, 32 Table 5: The hyper-parameters tested in tuning our models, where the best ones used in our final experiments are highlighted in boldface. Table 6 reports the number of trainable parameters and the inference speed (sentences per second) of the baseline (i.e., BERT (single), BERT (union), and FL with BERT-base and BERT-large) and our models (i.e., TM-FL with BERT-base and BERTlarge) on all of the three datasets. All models are performed on an NVIDIA Tesla V100 GPU.

Appendix B. Model Size and Performance
Appendix C. Experimental Results on the Development Set Table 7 reports the F1 scores of different models on the development sets of LAP14 and REST14. 17

Appendix D. Mean and Deviation of the Results
In the experiments, we test models with different configurations. For each model, we train it with the best hyper-parameter setting using five different random seeds. We report the mean (µ) and standard deviation (σ) of the F1 scores on the test sets of LAP14, REST14 and TWITTER in  Table 6: Numbers of trainable parameters (Para.) in different models and the inference speed (sentences per second) of these models on the test sets of both datasets. "BT-b" and "BT-l" refer to encoder BERTbase and BERT-large respectively.   Table 8: The mean (µ) and standard deviation (σ) of F1 scores of our TM-FL model and baselines on the test set of LAP14, REST14 and TWITTER for aspectbased sentiment analysis. "BT-b" and "BT-l" refer to encoder BERT-base and BERT-large respectively.