Chinese Opinion Role Labeling with Corpus Translation: A Pivot Study

Opinion Role Labeling (ORL), aiming to identify the key roles of opinion, has received increasing interest. Unlike most of the previous works focusing on the English language, in this paper, we present the first work of Chinese ORL. We construct a Chinese dataset by manually translating and projecting annotations from a standard English MPQA dataset. Then, we investigate the effectiveness of cross-lingual transfer methods, including model transfer and corpus translation. We exploit multilingual BERT with Contextual Parameter Generator and Adapter methods to examine the potentials of unsupervised cross-lingual learning and our experiments and analyses for both bilingual and multilingual transfers establish a foundation for the future research of this task.


Introduction
Fine-grained opinion mining has been a crucial task in natural language processing (NLP) for a long time (Kim and Hovy, 2006a;Breck et al., 2007;Wilson et al., 2009;Qiu et al., 2011;Irsoy and Cardie, 2014;Liu et al., 2015;Wiegand et al., 2016) and it aims to discover useful structural information of user opinions from unstructured text, which is the relation between expression and entities such as Who expressed what kind of sentiment towards what?. The EXPRESSION conveys attitudes including sentiments, agreements, beliefs, or intentions (e.g., voiced his condolences in Figure 1); the entities consist of the HOLDER who expresses the opinion (e.g., Chen.) and the TARGET which the opinion is expressed to (e.g., the families) (Breck et al., 2007;Yang and Cardie, 2012;Katiyar and Cardie, 2016a). Here we focus on the opinion role labeling (ORL) task which is to identify opinion holders and  targets (Marasović and Frank, 2017;Zhang et al., 2019a) when the expressions are given.
Most of the previous researches focus on the English ORL, benefiting from the benchmark MPQA dataset (Wiebe et al., 2005) which includes spanbased annotations of opinion expressions, holders and targets. The task is commonly solved by sequence labeling models with the BIO conversion scheme (Kim and Hovy, 2006b;Choi et al., 2006;Yang and Cardie, 2013;Johansson and Moschitti, 2013). Recently, neural BiLSTM-CRF models have achieved state-of-the-art performance on this task (Katiyar and Cardie, 2016b; Marasović and Frank, 2017;Zhang et al., 2019a). However, the studies on other languages are relatively rare due to the scarcity of annotated datasets. To our best knowledge, there is only one exception by Almeida et al. (2015a), which has annotated a small-scale dataset for the Portuguese language.
Unsupervised cross-lingual transfer (Xu et al., 2018) is one promising way to address the low resource problem for ORL. Under the neural setting, there are two representative categories of methods: model transfer (McDonald et al., 2013;Swayamdipta et al., 2016;Daza and Frank, 2019) and corpus translation (Zhang et al., 2019b). The model transfer trains a model on a resource-rich language by using only language-independent features such as multilingual BERT (Devlin et al., 2018;Pires et al., 2019) and then apply it to the target language. The corpus translation approach firstly obtains parallel corpora through either human or machine translation and then projects the annotations from the source language to the target side.
In this work, we present the first study of the Chinese ORL. First, we construct a benchmark corpus by manually translating the English MPQA corpus, which involves auto-translation (i.e., automatic sentence translation and opinion aligning) and human refinement. Second, we investigate the performance of unsupervised cross-lingual transfer for Chinese ORL based on the annotated corpus. We investigate the Contextual Parameter Generator Networks (PGN) in multilingual BERT with Adapter (known as parameter efficient in learning) method (Üstün et al., 2020) (we call it PGN-Adapter) and discover the complementarity of the model transfer and corpus translation methods.
We conduct experiments on the newly constructed Chinese dataset to evaluate our methods, together with the English MPQA corpus (Wiebe et al., 2005) and the Portuguese dataset (Almeida et al., 2015a) for cross-lingual transfer. We observe that for the unsupervised cross-lingual transfer from the English corpus, the translation-based method is better than the model transfer, and their combination leads to further improvements. Although the scale of the Portuguese corpus is much smaller, adding it into the multilingual transfer still outperforms the bilingual counterpart.
To summarize, in this paper, we have the following contributions: • We manually translate and annotate a Chinese fine-grained ORL corpus for research purposes, especially for the cross-lingual ORL study.
• We conduct cross-lingual ORL (to Chinese) through unsupervised model transfer and corpus translation with PGN-Adapter, setting up strong baselines for future research.
• We perform extensive experiments and analyses to demonstrate the pros and cons of the different approaches.

Related Work
Fine-Grained Opinion Mining There have been a number of studies in fine-grained opinion mining (Wilson et al., 2009;Qiu et al., 2011;Wiegand et al., 2016). Kim and Hovy (2006a) exploit a semantic role labeller to extract opinion holders and topics. Choi et al. (2005) and Breck et al. (2007) (2014) and Liu et al. (2015) use recurrent neural network for opinion mining. Johansson and Moschitti (2013) and Katiyar and Cardie (2016a) propose joint models for opinion expressions, holders and targets.
Opinion Role Labeling As for ORL, Marasović and Frank (2018) exploit multi-task learning about how to use SRL information to improve ORL scores. Zhang et al. (2019a) utilize semantic role labeling to enhance ORL, where three different integrating approaches are compared. Bo et al. (2020) propose a dependency-based graph convolutional networks to enhance ORL with syntax information. All these studies focus on the English ORL by using supervised models, assuming that a training corpus is already available. In this work, we investigate Chinese ORL, building a benchmark dataset for Chinese manually and then studying unsupervised cross-lingual transferring for the task.
Cross-Lingual Transfer Learning Crosslingual transfer learning has been extensively applied in NLP, including sentiment classification (Zhou et al., 2016), POS tagging Wisniewski et al., 2014;Kim et al., 2017), named entity recognition (Zirikly and Hagiwara, 2015), semantic role labeling (Fei et al., 2020), and dependency parsing (McDonald et al., 2011;Tiedemann et al., 2014;Guo et al., 2016;Zhang et al., 2019b). Unsupervised cross-lingual transferring has received great interest (Duong et al., 2015;Xu et al., 2018), which is our major focus. The work of Zhang et al. (2019b) is mostly related to our study, which applies model transferring and corpus translation to dependency parsing. Our work focuses on ORL, applying the two approaches for the Chinese language.  (2018), because 1) Google's MT system is much better in 2020 than in 2018. Lower translation quality causes more problems for the argument mining; while high-performance MT system enhances the translation-based approach. 2) Projection strategy is also different from ours (Section 5.1). We choose to project the non-cross labels only, in order to ensure the mapping quality. 3) In addition, the more advanced methods like PGN, Adapter and BERT also play a significant role in cross-lingual tasks.

The Construction of Chinese Dataset
We manually construct a Chinese ORL dataset to facilitate our research. In order to reduce the overall cost, we exploit corpus translation to assist the construction process, converting the English MPQA corpus (Wiebe et al., 2005;Wilson, 2008) into Chinese. The conversion contains the following four steps by order: (1) sentence translation, (2) manual revision, (3) opinion projection, and (4) manual correction. The first and third steps formalize into automatic corpus translation, which has been used as one approach for unsupervised cross-lingual transfer, and the second and fourth steps are used to ensure the final quality. The whole construction is conducted at the sentence-level.
Sentence Translation Neural machine translation (NMT) has achieved state-of-the-art performances for a range of language pairs (Vaswani et al., 2017). In particular, the state-of-the-art NMT can reach a BLEU score over 45 (Li et al., 2019). Thus it is applicable to use NMT for automatic sentence translation. Here we first translate all the English sentences of the MPQA dataset into Chinese by using the google translator 2 automatically.
Manual Revision Next, we let several native speakers check the translation quality, and make revisions to the imperfect translations. There can be two types of revisions. On the one hand, the translated sentences may have errors, and human intervention is required to correct these issues. On the other hand, the automatic sentences may not match the style of native speakers, and we let our annotators rewrite these sentences. Table 1 shows two examples of the two conditions, respectively.
Opinion Projection Third, we project all opinions (expressions, holders and targets) from the English sentence into its Chinese translation. Before the projection, we use the Stanford Segmentor tool for word segmentation 3 . The overall projection is supported by automatic word alignments, which can be produced by using a word-alignment tool.
Here we exploit the fast-align tool 4 (Dyer et al., 2013) to calculate the alignment probabilities. Figure 2 shows an example to illustrate the projection process. Concretely, given an English-Chinese sentence pair (e 1 · · · e n , c 1 · · · c m ) and its English-to-Chinese alignment probabilities a(c j |e i ), the projection is performed as follows: (1) We incrementally obtain the text spans in the Chinese sentences for the opinion expressions as well as their holders and targets in the English sentence.
(2) For each word e i in the English sentence, we find its corresponding word c p i in the Chinese sentence by using p i = arg max j a(c j |e i ), resulting in a set of one-one mapping word pairs: M = {(e 1 , c p 1 ), · · · , (e n , c pn )}.
(3) For each span e i,j (i.e., expression, holder or target) in the English sentence, we find its corresponding span c i ,j in the Chinese sentence by maximizing the covered word-pair set M with the least span length.
(4) We remove the projected span when (j − i ) ≥ 2 * (j − i + 1) which is regarded as low-quality. If one expression is removed, its holder and target are removed as well.
Manual Correction The last step is to perform another checking manually to ensure the quality of automatic opinion projection. There could be several types of errors, including word boundary errors and miss-alignments. And as for the continuity and fluency in the sentence, we do some trade-offs as shown in Table 2. By using the above four steps, we can obtain a benchmark dataset for Chinese ORL, the argument comparison about Chinese and English can   be seen in Table 3. Noticeably, when we apply the corpus translation approach for unsupervised cross-lingual transferring, only the first and third steps are required, and all human interventions would be removed for full automation. For corpus translation of other language pairs, one just need to replace English by the desired source language, and Chinese by the desired target language.
For manual revisions of the translated sentences and corrections of final ORL annotations, we recruit three volunteers, which are all native Chinese speaking and fluent in English. For translation part, it is based on the result of machine translation and then we made a few minor corrections. Two students modify it first by themselves, and then they proofread it together to select the better one between the two cases they translated. If they have a conflict, the third one can make suggestions and get a final version. As long as the sentence's meaning is correct and conforms to Chinese habits, so we did not calculate translation's kappa. The kappa scores of word alignment are much higher, so we omit it in the paper. Since the corpus is not built from scratch, it is constructed by translation. Thus, the selected sentences are directly sourced from the English MPQA corpus. The manual efforts of translation, as well as the alignment, are highly straightforward with little ambiguities. For word alignment part, it is similar to the translation part.

Model
Opinion role labeling aims to discover the opinion arguments given opinion expressions. The task can be modeled as a sequence labeling problem (Zhang et al., 2019a). We adopt the BMESO scheme to convert spans of opinion arguments into a sequence of word-level boundary tags, where B, M and E denote the beginning, middle and ending words of an argument, respectively, S denotes the word itself is an argument, and O denotes the rest of the words. Formally, assuming that the input sentence is sent = w 1 , ..., w n , and a given opinion expression is expr = w b , ..., w e (1≤ b ≤ e ≤n), our task is to assign a sequence of boundary tags t 1 , ..., t n . We exploit a BiLSTM-CRF framework based on PGN-Adapter to implement our model. Figure 3 shows an overview of our model.

PGN-Adapter
This model (Üstün et al., 2020) is based on the traditional BERT architecture (Devlin et al., 2018). Let the whole input sentence w 1 , · · · , w n which is decomposed into the wordpiece sequences c 1 , · · · , c m and get the input representation r i , · · · , r m by summing each c i and position embedding p i . Then, each r i is passed to a stacked self-attention layers (SelfAttn) to generate where θ ada denotes the adapter modules. Following Houlsby et al. (2019), in this module, two adapters with two feedforward projections and a GELU nonlinearity are merged into each transformer layer as shown in Figure 3.
To obtain the amount of sharing cross languages, we generate the trainable parameters of the adapter module by the PGN method. The weights of the adapters are following: where W ada is the parameters in adapter modules. The parameters in the BERT model are frozen except the adapter part, thus our model can be much more parameter-efficient than fine-tuning, and meanwhile, our preliminary results show that the method does not hurt the performance. I e is a language embedding by a multi-layer perceptron MLP lang : where I t is a typological feature vector from the URIEL language typology database (Littell et al., 2017). Following Üstün et al. (2020), we set our language embeddings from syntactic, phonological and phonetic inventory feature with k-nearest neighbors approach.
Further, word-level representations x 1 · · · x n can be derived by averaged pooling over the covered word pieces of each word.
BiLSTM-CRF Following the front outputs, we apply bi-directional LSTMs (BiLSTM) and conditional random fields (CRF) (Lafferty et al., 2001) to get high-level features and compute the probability of each candidate output tag sequence y = y 1 , · · · , y n . The concrete calculation method is defined as follows: y =y 1 ,··· ,y n e SCORE(y ) where y traversing all candidate outputs, W and T are parameters. We use the Viterbi algorithm based on SCORE(y) to search for the ORL tag sequence of the maximum score.
Training Objective We exploit sentence-level cross-entropy loss for model training, which can be described as: L = − log p(g = g 1 , ..., g n |sent, expr) (5) where g = g 1 , ..., g n denotes the gold-standard tag sequence of a given sentence-expression pair.
Section #sent Holder Target  English  Train  4846  2438  2533  Dev  2298  1196  1259  Test  1435  779  802  Chinese  Train  4846  2417  2457  Dev  2298  1196  1259  Test  1435  759   Model Transfer We use multilingual word representations to achieve the cross-lingual transfer of the model. In particular, we use the pre-trained multilingual BERT-Base (cased version) 5 (Devlin et al., 2018). All pretrained parameters inside the BERT are frozen during ORL training.

Datasets
English Dataset. We use the widely-adopted ORL benchmark dataset, the MPQA version 2.0 (Wiebe et al., 2005;Wilson, 2008) to evaluate our models. We focus on identifying expression-holder and expression-target relations with expressions given. We split the whole corpus into fixed training, development and testing sections.
Chinese Dataset. We construct the Chinese dataset as described in Section 3, and the basic statistics of the corpora are shown in Table 4 where we should point out that the reason for the number of auto-annotated opinion-arguments reduction is that we removed the cross labels. The Chinese dataset consists of three parts: 1) manually translated and word-aligned from the English dataset; 2) manually translated but automatically word-aligned (Train en half −auto ); and 3) automatically translated and word-aligned (Train en auto ). The data splitting method is directly mapped from the corresponding English divisions for a fair investigation.
Portuguese Dataset. We use the Portuguese ORL dataset released by Almeida et al. (2015b) for the cross-lingual transfer as well. The whole dataset is used to help Chinese ORL. 5 https://github.com/google-research/bert

Evaluation Metrics
As usual, we use precision (P), recall (R) and (Exact) F1-score to evaluate our methods. Following (Marasović and Frank, 2017), we exploit two additional soft evaluation metrics, binary and proportional overlapping scores. In detail, Binary F1 treats an entity as correct if it contains an overlapping region with the gold-standard, and the proportional F1 assigns a partial score proportional to the ratio of the overlapping region.

Settings
The implementation of models of our experiments are Pytorch with version 1.4 and GPU device with V100 (32G). There are several hyper-parameters contained in the model. We set the output hidden size of BiLSTM to 200 and the layer number of BiLSTM to 3. To prevent overfitting, we set the dropout rate to 0.33. As for PGN-Adapter, we set the language embedding size in [16,32,50,64], adapter size in [128,256,512], language embedding dropout rate is 0.1.
We exploit online training to learn model parameters, and use the Adam optimizer (Kingma and Ba, 2014) with learning rate 0.002. And the adapter module learning rate is 5e-6 by AdamW optimizer (Loshchilov and Hutter, 2017). The mini-batch size is set to 32, and the parameters of the model are updated every 4 mini-batches. We use gradient clipping by max norm 1.0. By default, we set the maximum epoch number to 40 to early stop, and evaluate development and test datasets every 160 steps. At last, we select the final model when the development's result is the best one.

Models
In order to make a better analysis of the experiment, we selected three models: BERT, Adapter, and PGN-Adapter.
BERT We use multilingual BERT-Cased to be the baseline model.
Adapter The Adapter modules in BERT can capture language-specific information automatically. It also is a baseline to further verify the validity of the next model.

PGN-Adapter
Compared with Adapter, PGN-Adapter model integrates PGN method in Adapter, and it incorporates richer language information from an external given language type. To the best of our knowledge, this model is likely to be better at  achieving the best possible performance for crosslanguage tasks. The purpose of using this model is to better show the following model transfer and translation-based cross-lingual transfer methods. Note that if the source and target languages are the same, only the first model, BERT, is used.

Experiment Results and Analysis
This section provides an overview of the English-Chinese transfer and multilingual transfer experiments on the basis of adding Portuguese to the Chinese target language. Further, we analyze the results in detail. In order to understand the performance of different roles (i.e., the HOLDER and TARGET), we measure the performance variance along with the span length of the arguments. As for the cross-lingual analysis, apart from the MT-based setting, we also add one more semi-automatic setting, i.e., manual translation with automatic alignment.
Just to be clear in advance, the following method CorpusTrans is using the fixed multilingual BERT embeddings only, no Adapter or PGN-Adapter method, as the Train, Dev and Test datasets are all in Chinese.

English-Chinese Transfer
Our experiments mainly focus on Chinese as the target language. Table 5 shows the results of the English-Chinese transfer. We observe that 1) almost all experiments using the PGN-Adapter model achieve the best results in each group; 2) when manual translation and annotation are available, the model combining all the datasets performs the best: English, Chinese and automatic translation from the English corpus; 3) comparing model transfer and automatic translation, the latter outperforms the former by a large margin; and 4) if we combine the two approaches, we further improve the performance, although still inferior to the manual translation model. In short, from English to Chinese, the translationbased method is in favor of the model transfer, even with machine-translated data.
Notice that, the "auto" approaches could be easily adapted into other language pairs without large labor costs. Apart from Portuguese (Section 6.2), we will explore more low-resource target languages on the source side in the future.

Multilingual Transfer
In this section, we also conduct experiments by using multilingual transfer, thanks to the availability of the Portuguese dataset (Almeida et al., 2015a). For a fair comparison, we still focus on Chinese as the target language.
Adding the English and Portuguese, Table 6 displays the experimental results. The first two train corpora (4 lines) represent the model transfer and (automatic) translation-based methods with two languages, respectively. The observations are similar to the bilingual settings. When we combine the two methods, only a few indicators have improved. When we compare the results with additional Portuguese data in Table 5, we find that the model transfer benefits from the second source language, but in the corpus translation part, performance is noticeably decreasing, due to the low quality of the   machine-translated data. The reason behind this may be that according to the MT community, the English-Chinese MT system achieves a 45+ BLEU score (Li et al., 2019), while Portuguese-Chinese MT is only around 20 (Liu et al., 2018). Figure 4 illustrates the performance change along with the span length of the arguments, holder (up) and target (down), respectively. For the holder, the general tendency goes down, long spans with worse performance. However, it is worth pointing out that the CorpusT rans method performs well at the long span holders, even higher than the HumanT rans method which is created by human translators. We also see the M odelT rans makes a worse score in the long span, but adding the CorpusT rans obtains a similar performance with HumanT rans. That is to say, model transfer and translation-based model together are very helpful for both long and short HOLDERs. For the target, the best performances are all achieved for the middle-length spans, suggesting that the average length of the target is 4.8 words that is longer than 2.0 words about the holder. We speculate that short spans may not contain enough semantics, while the longer span's boundaries are not trivial to recognize correctly. As for Combine, we see the score in the middle even higher than the manual translation (HumanT rans), due to the mutual enhancement of the two methods. HumanTrans is human translation, CorpusTrans is corpus translation, ModelTrans is model transfer and Combine is to merge 3 methods together.

Influence of the Automatic Alignment
In addition to the machine translation setting and manual translation setting and alignment setting, we also explore the setting of manual translation with automatic alignment. Table 7 lists the results for comparison. We observe that the automatic alignment setting (with human translation) performs the worst among the three configurations. This might seem to be unexpected at the first glance, since the translation quality is still much better than the machine-translated ones. We speculate that, since human translation and machine translation behave quite differently, MT systems rely more on word alignment, while humans usually translate the sentence as a whole. The automatic alignment fails to transfer the annotations from the source side to the target.

Conclusions
We presented the first work of Chinese ORL. First, we manually constructed a Chinese dataset with the help of corpus translation. Then, we investigated unsupervised cross-lingual transfer for Chinese ORL. We studied two different approaches, model transfer and corpus translation, respectively. Experiments and analyses were performed based on the annotated dataset. Results showed that unsupervised cross-lingual transfer is an effective method for Chinese ORL, and in addition, multisource transfer further improves the results which are promising for future exploration of such crosslingual transfer to other low-resource languages.