Federated Chinese Word Segmentation with Global Character Associations

Chinese word segmentation (CWS) is a fundamental task for Chinese information processing, which always suffers from out-of-vocabulary word issues, especially when it is tested on data from different sources. Although one possible solution is to use more training data, in real applications, these data are stored at different locations and thus are invisible and isolated among each other owing to the privacy or legal issues (e.g., clinical reports from different hospitals). To address this issue and beneﬁt from extra data, we propose a neural model for CWS with federated learning (FL) adopted to help CWS deal with data isolation, where a mechanism of global character associations is proposed to enhance FL to learn from different data sources. Experimental results on a simulated environment with ﬁve nodes conﬁrm the effectiveness of our approach, where our approach outperforms different baselines including some well-designed FL frameworks. 1


Introduction
Chinese word segmentation (CWS) is a preliminary and vital task for natural language processing (NLP). This task aims to segment Chinese character sequence into words and thus is generally performed as a sequence labeling task (Tseng et al., 2005;Levow, 2006;Song et al., 2009a;Sun and Xu, 2011;Xia, 2012, 2013;Mansur et al., 2013). Although recent neural-based CWS systems (Pei et al., 2014;Chen et al., 2017;Ma et al., 2018;Higashiyama et al., 2019;Qiu et al., 2019;Ke et al., 2020;Huang et al., 2020a;Tian et al., 2020e) have achieved very good performance on benchmark datasets, it is still an unsolved task (Fu * Equal contribution. † Corresponding author. 1 The code and models involved in this paper are released at https://github.com/cuhksz-nlp/GCASeg. et al., 2020), because it is challenging to handle out-of-vocabulary words (OOV), especially in real applications where the test data may come from different sources. Although leveraging extra labeled data from other sources or domains could alleviate this issue, in real applications, such data are always located in different nodes and thus are inaccessible to each other because of the privacy or legal concerns (e.g., clinical or financial reports from different hospitals or companies).
To address the data isolation issue, federated learning (FL) (Shokri and Shmatikov, 2015;Konečnỳ et al., 2016) is proposed and has shown its great promises for many machine learning tasks (Aono et al., 2017;Sheller et al., 2018;He et al., 2020). In many cases, data in different nodes are encrypted and aggregated to the centralized model, and they are invisible to each other during the training stage. This property allows FL to be an essential technique for real applications with privacy and security requirements. However, conventional FL techniques are more suitable for nodes sharing homogeneous data, which is seldom the case for NLP tasks. Particularly for CWS, the appropriate segmentation is sensitive to the data source, where the text and vocabularies used in different datasets contain various expressing patterns. For example, in real applications such as Input Method Editors (IME, such as pinyin input environment), there are millions of individual users with their data stored in isolated nodes, where the different nodes could have diverse segmentation requirement due to the users' preference. Therefore, the restricted data access of traditional FL approaches could result in inferior performance for CWS since they cannot update the model to facilitate localized prediction. Unfortunately, limited attentions have been paid to address this issue. Most existing approaches (Liu et al., 2019;Huang et al., 2020b;Sui et al., 2020) with FL on NLP (e.g., for language modeling Figure 1: The server-node architecture of our approach. The encrypted information (i.e., encrypted data, word segmentation tags, and loss) communicates between a node and the server, where the locally stored data is inaccessible to other nodes during the training process. (Hard et al., 2018;Chen et al., 2019), named entity recognition (Ge et al., 2020), and text classification (Zhu et al., 2020)) mainly focus on optimizing the learning process and ignore domain diversities.
In this paper, we propose a FL-based neural model (GCA-FL) for CWS, which is enhanced by global character association (GCA) mechanism in a distributed environment. The GCA mechanism is designed to capture contextual information (patterns) in a particular input for localized predictions and to handle the difficulties in identifying text sources caused by data inaccessibility. Specifically, GCA is served as a server-side component to associate global character n-grams with different inputs from each node and responds with contextual information to help the backbone segmenter. Experimental results on a simulated environment with isolated data from five domains demonstrate the effectiveness of our approach, where GCA-FL outperforms different baselines including the ones with well designed FL framework. Figure 1 illustrates the overall server-node architecture for applying our approach. The centralized model is stored in the FL server and data from multiple sources (domains) are stored at different nodes (the i-th node is denoted by N i ), respectively. Encrypted information (e.g., data, vectors, and loss) communicates between each node N i and the FL server. In this way, the original data stay in the local node and is not accessible to the other nodes. To encode contextual information (patterns) to facilitate localized prediction, we enhance FL by introducing GCA into the centralized model (Figure 2), which follows the character-based sequence labeling paradigm for CWS. Herein, GCA encodes the contextual information from the encrypted input and uses the resulted information to guide the centralized model to make a localized prediction. In the following, we introduce FL for CWS and then the centralized model with GCA.

Federated Learning
In the training process of FL, the node N i firstly encrypts the original input character sequence X i into X i = x i,1 · · · x i,j · · · x i,l , where x i,j denotes the j-th character in X i . Next, N i passes X i to the server. Then, the centralized model on the server processes X i and predicts the corresponding label sequence Y i = y i,1 · · · y i,j · · · y i,l by where y i,j ∈ T (T is the label set) is the segmentation label for x i,j . Afterwards, Y i is passed back to N i and compared with the gold label sequence Y * i , after which the loss L i for that training instance is obtained locally. Finally, L i is passed to the server and the parameters in the centralized model are updated accordingly.

Centralized Model with GCA
In standard FL framework, the backbone centralize model works following the encoding-decoding paradigm, where X i is encoded 2 into a sequence of hidden vectors (h i,j denotes the hidden vector for x i,j ), which are then sent to a decoder (e.g., CRF) to obtain the prediction Y i . However, data from different nodes (sources) always contains heterogeneous vocabularies and expressing patterns, where standard FL may obtain inferior results for localized prediction because it cannot distinguish the contextual information from the isolated data. Therefore, motivated by previous studies that leverage n-grams to capture local contextual information (Song et al., 2009b;Pei et al., 2014;Chen et al., 2017;Higashiyama et al., 2019), we propose GCA to enhance standard FL by exploring the contextual information carried by n-grams in the running text and use it to guide the centralized model for making localized prediction. Specifically, GCA contains three components, namely, a lexicon (D) that contains global character n-grams, an n-gram embedding matrix that maps an n-gram in D to its embedding, and a position embedding matrix that maps a position pattern, i.e., the position (e.g., beginning, ending, and inside) of a character in an n-gram, to its embedding. For each character x i,j , GCA encodes the contextual information and uses it to enhance the centralized model in the following process. First, GCA extracts all m n-grams s i,j,k (1 ≤ k ≤ m) associated with x i,j from D, where s i,j,k satisfies the conditions that it contains x i,j and it is a sub-string of X i . Next, according to the position of x i,j in s i,j,k , GCA finds the position pattern v i,j,k associated with s i,j,k and x i,j based on the rules specified in Table 1. For example, if x i,j ="市" (city) and s i,j,k ="市长" (mayor), the position pattern v i,j,k will be "V B " according to the rules in Table 1, because x i,j is at the beginning of s i,j,k . Third, GCA applies n-gram embedding matrix and position embedding matrix to s i,j,k and v i,j,k , respectively, and obtains the ngram embedding e s i,j,k and the position embedding e v i,j,k . Then, GCA computes the weights p i,j,k for position patterns v i,j,k by  where W is a trainable matrix that maps e s i,j,k to the same dimension as h i,j to facilitate the inner production "·". GCA further applies p i,j,k to all position embeddings and obtain the representation of contextual information u i,j for x i,j by Afterwards, u i,j is added (+) to h i,j to guide the backbone model for localized prediction, where the resulted vector is mapped into the output space by a trainable matrix W o and bias b o by Finally, o i,j is fed into a CRF decoder to obtain the predicted segmentation label y i,j for x i,j .

Simulations
To test the proposed approach, we follow the convention of recent FL-based NLP studies (Liu et al., 2019;Huang et al., 2020b;Zhu et al., 2020;Sui et al., 2020) to build a simulated environment where isolated data are stored in five nodes. Each node contains one of the five genres (i.e., broadcast conversation (BC), broadcast news (BN), magazine (MZ), newswire (NW), and weblog (WEB)) in CTB7 (LDC2010T07) 3 for CWS. Therefore, data from different genres are distributed to the five nodes without overlapping (i.e., the data sources in  our simulation are heterogeneous), which is similar to the simulation setting of aforementioned previous studies. We split each genre into train/dev/test splits following Wang et al. (2011) and report the statistics (in terms of the number of sentences, word tokens, and OOV rate) in Table 2.

Baselines and Reference Models
To show the effectiveness of our approach with GCA, we compare it with a baseline model that follows the FL framework without using it. In addition, we also run two reference models without both FL and GCA, where all training instances are not isolated and are accessible to each other. Specifically, the first reference model (denoted by Single) is trained and evaluated on the data from a single node (genre). The second (denoted by Union) is trained on the union of training instances from all five nodes (genres) and evaluated on a single node. Herein, the Union reference model can be optimized on a particular local node to achieve the best localized prediction; on the contrary, models under the FL setting is stored on the server and shared by all nodes, so that optimizing the model on a particular node could significantly hurt the performance on others. Therefore, the setting of the Union reference model is the ideal situation which is hard to happen in real-applications and it thus provides a potential upper-boundary of model performance for FL-based approaches.

Implementation
A good text representation is generally a prerequisite to achieve outstanding model performance (Pennington et al., 2014;Peters et al., 2018). To obtain a high qulity of text representation, in our experiments, we try two types of encoder in the centralized model, i.e., the Chinese version of BERT (Devlin et al., 2019) 4 and the large version of ZEN 2.0  5 , because they are pre-trained language models that have been demonstrated to be effective in many NLP tasks (Nie et al., 2020;Huang et al., 2020a;Fu et al., 2020;Tian et al., 2020aTian et al., ,b,c,d, 2021aQin et al., 2021). For both BERT and ZEN 2.0, we use the default settings (i.e., 12 layers of multi-head attentions with 768 dimensional hidden vectors for BERT and 24 layers of multi-head attentions with 1024 dimensional hidden vectors for ZEN 2.0). We use the vocabulary in Tencent Embedding 6  to initialize our lexicon D and the n-gram embedding matrix, where n-grams whose character-based length higher than five are filtered out 7 . During the training stage, we fix the n-gram embedding matrix and update all other parameters (including BERT). For evaluation, we follow previous studies to use the F1 scores (Chen et al., 2017;Ma et al., 2018;Qiu et al., 2019). For other hyperparameter settings, we report them in Table 3. We test all combinations of them for each model on the development set, where models achieve highest F1 score on the development set is evaluated on the test set (the best hyper-parameter setting in our experiments is highlighted in boldface). Table 4 illustrates the experimental results (i.e., F1 scores) of our GCA-FL models and all the aforementioned baselines (i.e., FL) and reference models (i.e., Single and Union) with BERT (a) and ZEN 2.0 (b) encoders on the test set of BC, BN, MZ, NW, and Web from CTB7. There are several observations from the test set results. First, models under the FL framework (i.e., FL and GCA-FL) outperform the reference model (Single) trained on the single node for both BERT and ZEN 2.0 encoder, which confirms that FL works well to leverage extra isolated data. Second, our GCA-FL model consistently outperforms the FL baseline on all nodes (genres), although the FL baseline with BERT and ZEN 2.0 has already achieved very good performance. This observation demonstrates the effectiveness of the proposed GCA mechanism to leverage contextual information to facilitate localized prediction. Third, it is  observed that GCA-FL achieves competitive results compared with the Union reference model in most cases. This observation is promising because the Union model has all training data available without suffering from the data isolation problem, which could provide a potential upper boundary for FLbased models. The results obtained from GCA-FL thus further confirm the effectiveness of GCA.

Effect of GCA
To analize the effect of GCA to leverage isolated extra data to facilitate localized prediction, especially for OOV, we illustrate the recall of OOV of different models (i.e., Single, Union, FL, and GCA-FL) on five nodes (genres) with BERT encoder in Figure 3. Similar to the experimental results in the main experiments, it is observed that FL and GCA-FL outperform Single model in identifying unseen words (OOV). Further, GCA-FL can outperform the FL baseline on the test data in all nodes, where the highest improvement is observed on the node storing data from newswire. One possible explanation could be that BC and BN contains similar texts to NW. GCA-FL can better learn from the similar data on these nodes and thus improves localized prediction, especially for OOV.

Conclusion
In this paper, we apply FL to CWS to leverage isolated data stored in different nodes and propose GCA to enhance the CWS model stored in the server. Specifically, our approach encodes the contextual information by associating the input characters with global character n-grams, and uses that information to guide the backbone model to make localized predictions. Experimental results under a simulated environment performed on five isolated nodes on CTB7 demonstrate the effectiveness of the proposed approach. Our approach outperforms the baseline model trained under the FL framework and achieves competitive results compared with the reference model that is trained on the union of the data from all nodes. Further analyses on identifying OOV justify the validity of the GCA mechanism to leverage the data on other nodes to facilitate localized prediction and demonstrate its great potential to be applied to real-world applications.