Identifying Chinese Opinion Expressions with Extremely-Noisy Crowdsourcing Annotations

Recent works of opinion expression identification (OEI) rely heavily on the quality and scale of the manually-constructed training corpus, which could be extremely difficult to satisfy. Crowdsourcing is one practical solution for this problem, aiming to create a large-scale but quality-unguaranteed corpus. In this work, we investigate Chinese OEI with extremely-noisy crowdsourcing annotations, constructing a dataset at a very low cost. Following Zhang el al. (2021), we train the annotator-adapter model by regarding all annotations as gold-standard in terms of crowd annotators, and test the model by using a synthetic expert, which is a mixture of all annotators. As this annotator-mixture for testing is never modeled explicitly in the training phase, we propose to generate synthetic training samples by a pertinent mixup strategy to make the training and testing highly consistent. The simulation experiments on our constructed dataset show that crowdsourcing is highly promising for OEI, and our proposed annotator-mixup can further enhance the crowdsourcing modeling.


Introduction
Opinion mining is a fundamental topic in the natural language processing (NLP) community, which has received great attention for decades (Liu and Zhang, 2012). Opinion expression identification (OEI) is a standard task of opinion mining, which aims to recognize the text spans that express particular opinions (Breck et al., 2007). Figure 1 shows two examples. This task has been generally solved by supervised learning (İrsoy and Cardie, 2014) with the well-established corpus annotated by experts. Almost all previous studies are based on English datasets such as MPQA . By carefully examining this task, we can find that the corpus annotation of opinion expressions is * Corresponding author.  by no means an easy process. It is highly ambiguous across different persons. As shown in Figure 1, it is very controversial to define the boundaries of opinion expressions . Actually, this problem is extremely serious for languages such as Chinese, which is based on characters even with no explicit and clearly-defined word boundaries. Thus, Chinese-alike languages will inevitably involve more ambiguities.
In order to obtain a high-quality corpus, we usually need to train the annotators with great efforts, making them acquainted with a specific finegrained guideline drafted by experts, and then start the data annotation strictly. Finally, it is better with a further expert checking on borderline cases where the annotators disagree most to ensure the quality of the annotated corpus. Apparently, the whole process is quite expensive. Thus, crowdsourcing with no training (just a brief guideline) and no expert checking is more practical in real considerations (Snow et al., 2008). While on the other hand, the difficulty of the Chinese OEI task might lead to very low-quality annotations by crowdsourcing.
In this work, we present the first study of Chinese OEI by using crowdsourcing. We manually construct an OEI dataset by crowdsourcing, which is used for training. Indeed, the dataset is cheap but with a great deal of noises according to our initial observation. We also collect the small-scale devel-opment and test corpus with expert annotations for evaluation. 1 Our dataset is constructed over a set of Chinese texts closely related to the COVID-19 topic. Following, we start our investigation by using a strong BERT-BiLSTM-CRF model, treating the OEI task as a standard sequence labeling problem following the previous studies (Breck et al., 2007;İrsoy and Cardie, 2014;Katiyar and Cardie, 2016). Our primary goal is to answer whether these extremely-noisy crowdsourcing annotations include potential value for the OEI task.
In order to make the best use of our crowdsourcing corpus, we follow  to treat all crowd annotations as gold-standard in terms of different annotators. We introduce the annotatoradapter model, which employs the crowdsourcing learning approach of  in OEI for the first time. It jointly encodes both texts and annotators, then predicts the corresponding crowdsourcing annotations in the BERT-BiLSTM-CRF architecture. Concretely, we train the annotatoradapter model by each individual annotator and the corresponding annotations, then test the model by using a pseudo expert annotator, which is a linear mixture of crowd annotators. Considering that this expert is never modeled during the training, we further exploit a simple mixup (Zhang et al., 2018) strategy to simulate the expert decoding accurately.
Experimental results show that crowdsourcing is highly competitive, giving an overall F1 score of 53.86 even with a large-scale of noises, while the F1 score of expert corpus trained model is 57.08. We believe that this performance gap is totally acceptable for building OEI application systems. In addition, our annotator-mixup strategy can further boost the performance of the annotator-adapter model, giving an F1 increase of 54.59 − 53.86 = 0.73. We conduct several analyses to understand the OEI with crowdsourcing and our suggested methods comprehensively.
In summary, we make three majoring contributions as a whole in this work: • We present the initial work of investigating the OEI task with crowdsourcing annotations, showing its capability on Chinese. • We construct a Chinese OEI dataset with crowdsourcing annotations, which is not only valuable for Chinese OEI but also instructive for crowdsourcing researching. • We introduce the annotator-adapter for crowdsourcing OEI and propose the annotatormixup strategy, which can effectively improve the crowdsourcing modeling. All of our codes and dataset will be available at github.com/izhx/crowd-OEI for research purpose.

Dataset
The outbreak of COVID-19 brings strong demand for building robust Chinese opinion mining systems, which are practically built in a supervised manner. A large-scale training corpus is the key to the system construction, while almost all existing related datasets are in English . Hence, we manually construct a Chinese OEI dataset by crowdsourcing. We focus on opinion expressions with positive or negative polarities only. The construction consists of four steps: (1) text collection, (2) annotator recruitment, (3) crowd annotation, and (4) expert checking and correction.

Text Collection
We choose the Sina Weibo 2 , which is a Chinese social media platform similar to Twitter, as our data source. To collect the texts strongly related to COVID-19, we select around 8k posts that are created from January to April 2020 and related to seven hot topics (Table 1). To make these posts ready for annotating, we use HarvestText 3 to clean them and segment the resulting texts into sentences. Next, we conduct another cleaning step to remove the duplicates and sentences with relatively poor written styles (e.g., high-proportion of non-Chinese symbols, very short /long length, etc.).
After the above procedure, there are still a large proportion of sentences that involve no sentiment.
So we filter out them by a BERT sentiment classifier that trained on an open-access Weibo sentiment classification dataset. 4 Only sentences with high confidence of not expressing any sentiment are dropped, 5 we can therefore keep the most valuable contents while avoiding unnecessary annotations and thus reduce the overall annotating cost.

Annotator Recruitment
We have five professionals who have engaged in the annotation of sentiment and opinion-related tasks previously and are with rich experience as experts. They annotate 100 sentences together as examples (i.e., label the positive and negative opinion expressions inside the texts), and establish a simple guideline based on their consensus after several discussions. The guideline includes the task definition and a description of annotation principle. 6 Next, we recruit 75 (crowd) students in our university for annotating. They come from different grades and different majors, such as Chinese, Literature, and Translation. We offer them the above annotation guideline to understand the task. We choose the doccano 7 to build up our annotation platform, and let these annotators be familiar with our task by the expert-annotated examples.

Crowd Annotation
When all crowd workers are ready, we start the crowd annotation phase. The prepared texts are split into micro-tasks so that each one consists of 500 sentences. Then we assign 3 to 5 workers to each micro-task, and their identities are remained hidden from each other. Each worker will not access a new task unless their current one is finished.
In the annotation of each sentence, workers need to label the positive and negative opinion expressions according to the guideline and their understandings. The number of positive or negative expressions in one sentence has no limit. They can also mark a sentence as "No Opinion" and skip it if they think there are no opinion expressions inside.

Expert Checking and Correction
After all crowd annotations are accomplished, we randomly select a small proportion of sentences and 4 ChineseNlpCorpus -weibo_senti_100k 5 Note that there are still a small number of sentences in our final dataset that have no opinion expression inside. 6 We share the guideline in the Appendix A. 7 https://github.com/doccano/doccano  let experts reannotate them, resulting in the goldstandard development and test corpus. 8 Specifically, for each sentence, we let 2 experienced experts individually reannotate it with references from the corresponding crowdsourcing annotations. They will give the final annotation of each sentence if their answers reach an agreement. And if they have divergences, a third expert will help them to modify answers and reach the agreement. Then, we let all five experts go through the remaining dataset 9 , selecting the best annotations for each sentence, which can be regarded as the silverstandard training corpus. In the selection, Each sentence is assigned to 1 expert, and the expert is only allowed to choose one (or several identical) best answer(s) from all the candidate crowdsourcing annotations. Finally, only for comparisons, we also annotated the gold-standard training corpus, which will not be used in our model training.

Dataset Statistics
In the end, we arrive at 42, 274 crowd annotations by 70 valid annotators, 10 covering 10, 367 sentences. A total number of 803 + 1517 = 2320 sentences, including expert annotations, would be used for development and test evaluations. Table 2 shows the overall data statistics. The average number of annotators per sentence is 4.05, and each annotator labels an average of 827 sentences in the whole corpus. The overall Cohen's Kappa value of the crowd annotations is 0.35. When ignoring the characters which no annotators think that they are in any expression, the Kappa is only 0.17. 11 The Kappa values are indeed very low, indicating the great and unavoidable ambiguities of the task with natural annotations. 12 However, these values do not make much sense since we do not impose any well-designed comprehensive guidelines during annotation. In fact, a comprehensive guideline for crowd workers is almost impracticable in our task, because they are quite often to disagree with a particular guideline by their own unique and naive understandings. If we impose such a guideline to them forcibly, the annotation cost would be increased drastically (i.e., at least ten times more expensive according to our preliminary investigation) for their reluctance as well as endless expert guidance. In the remaining of this work, we will try to verify the real value of these crowdsourcing annotations empirically: Is the collected training corpus really beneficial for our Chinese OEI task?

Methodology
The OEI task aims to extract all polarized text spans that express certain opinions in a sentence. It can be naturally converted into a sequence labeling problem by using the BIO schema, tagging each token by the boundary information of opinion expressions, where "B-X" and "I-X" (i.e., "X" can be either "POS" or "NEG" denoting the polarity) indicate the start and other positions of a certain expression, and "O" denotes a token do not belong to any expression. In this work we adopt the CRF-based system (Breck et al., 2007) to the neural setting and enhance it with BiLSTM encoder as well as pre-trained BERT representation.

BERT-BiLSTM-CRF Baseline
Given a sentence x = x 1 · · · x n (where n denotes the sentence length), we first convert it into contextual representations r 1 · · · r n by the pre-trained BERT with adapter tuning (Houlsby et al., 2019): (1) Unlike the standard BERT exploration, AD-BERT introduces two extra adapter modules inside each transformer layer, as shown in Figure 2 for the 11 To compute the Kappa value of sequential annotations, we treat each token (not sentence) as an instance, and then aggregate the results of one sentence by averaging. 12 The average value of F1 scores that each annotator against the expert is 41.77%, which is significantly lower than 60%+ of crowdsourcing NER dataset (Rodrigues et al., 2014b). During the adapter tuning, green layers are trainable, including the adapters, the LayerNorm, and other task-specific modules.
details. With this modification, we do not need finetuning all BERT parameters, and instead, learning the parameters of adapters is enough for obtaining a strong performance. Thus ADBERT is more parameter efficient. The standard adapter layer can be formalized as: where W down , W up , b down and b up are model parameters, which are much smaller than the parameters of transformer in scale, and the dimension size of h mid is also smaller than that of the corresponding transformer dimension. 13 The rest part of the baseline is a standard BiLSTM-CRF model, which is a stack of BiL-STM, MLP and CRF layers, and then we can obtain sequence-level scores for each candidate output y: where p(y) is the probability of the given groundtruth, and Y is all possible outputs for score normalization. The model parameters are updated by the sentence-level cross-entropy loss L = − log p(y * ) when y * is regarded as gold-standard.
Crowdsourcing training. In the crowdsourcing setting, we only have annotations from multiple non-expert annotators, thus no gold-standard label is available for our training. To handle the situation, we introduce two straightforward and widely-used methods. First, we treat all annotations uniformly as training instances, despite that they may offer noises for our training objective, which is denoted by All for short. Second, we exploit majority voting 14 to obtain an aggregated answer of each sentence for model training, denoted as MV.

Annotator Adapter
In most previous crowdsourcing studies, there is a common agreement that crowd annotations are noisy, which should be rectified during training (Rodrigues et al., 2014a;Nguyen et al., 2017;Simpson and Gurevych, 2019).  propose to regard all crowdsourcing annotations as gold-standard, and introduce a representation learning model to jointly encode the sentence and the annotator and extract annotator-aware features, which models the unique understandings of annotators (this setting is indeed very consistent with our corpus). Since our constructed dataset has no gold-standard training labels 15 , we adopt their unsupervised representation learning approach, which is named annotator-adapter. It applies the Parameter Generator Network (PGN) (Platanios et al., 2018;Jia et al., 2019;Üstün et al., 2020) to generate annotator-specific adapter parameters for the ADBERT, as shown in Figure 3. Given an input sentence-annotator pair (x = x 1 , . . . , x n , a), we exploit an embedding layer to convert the annotator ID a into its vectorial form e a , and then PGN is used to generate the model parameters of several high-level adapter layers inside BERT conditioned by e a . Concretely, we apply PGN to the last p layers of BERT, where p is one hyper-parameter of our model. We refer to PGN-ADBERT for the updated input representation.
Formally, for an adapter defined by Equation 2, all its parameters are dynamically generated by: x1 · · · xn e a y a 1 · · · y a n PGN-ADBERT Encoding Decoding Figure 3: The annotator-adapter model. Given a joint input of the text x 1 · · · x n and the annotator ID a, we first convert a to its embedding e a . Then, PGN use e a generate annotator-specific parameters for the adapters in top p BERT layers (i.e., from L n to L n−p+1 ) to compute annotator-aware input representations. Finally, the BiLSTM encode the representations to high-level features and the CRF decoder predict the labels y a 1 · · · y a n that a gives to x 1 · · · x n .
where T W down , T b down , T Wup and T bup are learnable model parameters for the PGN-ADBERT. For any matrix-format model parameter W ∈ R M ×N , we have T W ∈ R M ×N ×d , where d is the dim of the annotator embedding. Similarly, for the vectorial parameter b ∈ R N , we have T b ∈ R N ×d . Thus, the overall input representation of the annotator-adapter can be rewritten as: r 1 · · · r n = PGN-ADBERT(x 1 · · · x n , e a ), (5) which jointly encodes the text and the annotator.
At the training stage, it uses the embedding of crowd annotators to generate crowd model parameters to learn crowd annotations. At the inference stage, it uses the centroid point of all annotator embeddings to estimate the expert, predicting the high-quality opinion expressions for raw texts. This expert embedding can be computed directly by: where A represents all annotators.

Annotator Mixup
By scrutinizing the annotator-adapter model, we can find that there is a minor mismatch during the model training and testing. During the training, the input annotators are all encoded individually. While during the testing, the input expert is a mixture of the crowd annotators, which is never modeled. To tackle this divergence, we introduce the mixup (Zhang et al., 2018) strategy over the individual annotators to generate a number of synthetic samples with linear mixtures of annotators, making the training and testing highly similar. The mixup strategy is essentially an effective data augmentation method that has received increasing attention recently in the NLP community (Zhang et al., 2020;Sun et al., 2020). The method is applied between two individual training instances originally, by using linear interpolation over a hidden input layer and the output. In this work, we confine the mixup onto the two training instances with the same input sentence for annotator mixup.
Formally, given two training instances (x 1 • a 1 , y 1 ) and (x 2 • a 2 , y 2 ), the mixup is executed only when x 1 = x 2 , thus the interpolation is actually performed between (a 1 , y 1 ) and (a 2 , y 2 ). Concretely, the input interpolation is conducted at the embedding layer, and the output interpolation is directly mixed at the sentence-level: where λ ∈ [0, 1] is a hyper-parameter which is usually sampled from the Beta(α, α) distribution, and y * is the one-hot vectorial form, where * ∈ [1, 2, mix]. 16 Finally, the loss objective of the new instance is calculated by: where all scores are computed based on x 1 /x 2 and e mix , and Y is all possible outputs for x 1 /x 2 . Finally, we can produce a number of augmented instances by the annotator mixup. These instances, together with the original training instances, are used to optimize our model parameters. The enhanced model is able to perform inference more robustly by using the mixture (i.e, average) of annotators, which is the estimation of the expert.

Setting
Evaluation. We use the span-level precision (P), recall (R) and their F1 for evaluation, since OEI is essentially a span recognition task. Following Breck et al. (2007);İrsoy and Cardie (2014), we exploit three types of metrics, namely exact matching, proportional matching and binary matching, respectively. The exact metric is straightforward and has been widely applied for span-level entity recognition tasks, which regards a predicted opinion expression as correct only when its start-end boundaries and polarity are all correct. Here we exploit the exact metric as the major method. The two other metrics are exploited because the exact boundaries are very difficult to be unified even for experts. The binary method treats an expression as correct when it contains an overlap with the ground-truth expression, and the proportional method uses a balanced score by the proportion of the overlapped area referring to the ground-truth.
We use the best-performing model on the development corpus to evaluate the performance of the test corpus. All experiments are conducted on a single RTX 2080 Ti card at an 8-GPU server with a 14 core CPU and 128GB memory. We run each setting by 5 times with different random seeds, and the median evaluation scores are reported.
Hyper-parameters. We exploit the bert-basechinese for input representations. 17 The adapter bottleneck size and the BiLSTM hidden size are set to 128 and 400, respectively. For the annotatoradapter, we set the annotator embedding size d = 8 and generate the adapter parameters for the last p = 6 BERT layers. For the annotator mixup, we set α of the Beta(α, α) distribution to 0.5.
We apply the sequential dropout to the input representations, which randomly sets the hidden vectors in the sequence to zeros with a probability of 0.2, to avoid overfitting. We use the Adam algorithm to optimize the parameters with a constant learning rate 1 × 10 −3 and a batch size 64, and apply the gradient clipping mechanism by a maximum value of 5.0 to avoid gradient explosion.
Baselines. Two annotator-agnostic baselines (i.e., ALL and MV) and the silver-corpus trained model Silver are all implemented in the same baseline structure and hyper-parameters. We also implement two annotator-aware methods presented in Nguyen et al. (2017), where the annotatordependent noises have been modeled explicitly. The LSTM-Crowd model encodes the output label bias (i.e., noises) for each individual annotator (biased-distributions) towards the expert (zeroeddistribution), and the LSTM-Crowd-cat model  applies a similar idea but implementing at the BiL-STM hidden layer. During the testing, zero-vectors are exploited to simulate the expert accordingly. Their main idea is to reach a robust training on the noisy dataset, which is totally different from our approach. In addition, we aggregate crowd labels of the training corpus by a Bayesian inference method (Simpson and Gurevych, 2019), namely BSC-seq, based on their code 18 and then evaluate its results with the same BERT-BiLSTM-CRF architecture. Table 3 shows the test results on our dataset. In general, the exact matching scores are all at a relatively low level, demonstrating that precise opinion boundaries are indeed difficult to identify. With the gradual relaxation of metrics (from exact to binary), scores are increased accordingly, showing that these models can roughly locate the opinion expressions to a certain degree.

Main Results
Dataset comparison. Similar to the tasks like NER (Zhou et al., 2021), POS tagging, dependency parsing (Straka, 2018) and so on, in which English models have performed better than the Chinese, we see the same pattern in our OEI task. The exact matching F1 57.08 of the Gold corpus trained model still has a performance gap compared with that of the English MPQA dataset (i.e., 63.71 by a similar BERT-based model of Xia et al. (2021)). This may due to (1) the opinion boundaries in the word-based English MPQA are easier to locate than our character-based Chinese dataset; (2) the social 18 https://github.com/UKPLab/arxiv2018-bayesianensembles media domain of our dataset, is more difficult than the news domain of MPQA.
Method comparison. First, we compare two annotator-agnostic methods (i.e., All and MV) with annotator-aware ones (i.e., the rest of models). As shown in Table 3, we can see that annotator-aware modeling is effective as a whole, bringing better performance on exact matching. In particular, our basic annotator-adapter model is able to give the best F1 among these selected baselines, demonstrating its advantage in crowdsourcing modeling. When the annotator-mixup is applied, the test scores are further boosted, showing the effectiveness of our annotator mixup. The overall tendencies of the two other metrics are similar by comparing our models with the others.
Our final performance is not only comparable to the silver corpus trained model, which we can take it as a weak upper-bound. but also close to the upper-bound model with expert annotations (i.e., Gold). Thus, our result for Chinese OEI is completely acceptable, demonstrating that crowdsourcing annotations are indeed with great value for model training. The observation indicates that crowdsourcing could be a highly-promising alternative to build a Chinese OEI system at a low cost.

Analysis
Here we conduct fine-grained analyses to better understand the task and these methods in-depth, where the evaluation by exact matching is used in this subsection. There are several additional analyses which are shown in the Appendix. Performance by the opinion expression length.
Intuitively, the identification of opinion expressions can be greatly affected by the length of the expressions, and longer expressions might be more challenging to be identified precisely. Figure 4 shows the F1 scores in terms of expression lengths by the four models we focused. We can see that the F1 score decreases dramatically when the expression length becomes larger than 4, which is consistent with our intuition. In addition, the annotatoradapter model is better than previous methods, and the mixup model can reach the best performance on almost all the categories, indicating the robustness of our annotator mixup.
Influence of the opinion number per sentence. One sentence may have more than one opinion expressions, where these opinions might be mutually helpful or bring increased ambiguities. It is interesting to study the model behaviors in terms of opinion numbers. Here we conduct experimental comparisons by dividing the test corpus into three categories: (1) only one opinion expression exists in a sentence; (2) at least two opinions exist, and they are of the same sentiment polarity; (3) both positive and negative opinion expressions exist. As shown in Figure 5, the sentences with multiple opinions of a consistent polarity can obtain the highest F1 score. The potential reason might be that the expressed opinions of these sentences are usually highly affirmative with strong sentiments, and the consistent expressions can be mutually helpful according to our assumption. For the other two categories, it seems that they are equally difficult according to the final scores. For all three categories, two annotator-adapter models demonstrate better performance than the others.   to predict opinion expressions and evaluate performance on the gold-standard annotations of experts. It is interesting to examine the self-evaluation performance on the crowd annotations of the test corpus as well. During the inference, we use the crowd annotators as inputs, and calculate the model performance on the corresponding crowd annotations. Table 4 shows the results. First, two annotatoragnostic models (i.e., ALL and MV) have similar poor performance since they are trying to estimate the expert annotation function rather than learn crowd annotations. Second, the performance of two annotator-noise-modeling methods, LSTM-Crowd and LSTM-Crowd-cat, respectively, is close to the annotator-agnostic ones, showing that they are also incapable to model individual annotators. Then, our two annotator-adapter models achieve leading performance compared with all baseline methods, giving a significant gap (at least 47.79 − 41.97 = 5.82 in F1). They are more capable of predicting crowd annotations, demonstrating the ability to model the annotators effectively. To our surprise, the mixup annotator-adapter model does not exceed the basic one, indicating that the mixed annotator embeddings in training could slightly hurt the modeling of individual annotators.

Related Work
OEI is one important task in opinion mining (Liu, 2012), and has received great interests (Breck et al., 2007;İrsoy and Cardie, 2014;Xia et al., 2021). The early studies can be dated back to  and Breck et al. (2007), which exploit CRFbased methods for the task with manually-crafted features. SemiCRF is exploited next in order to exploit span-based features (Yang and Cardie, 2012). Recently, neural network models have attracted the most attention.İrsoy and Cardie (2014) present a deep bi-directional recurrent neural network (RNN) to identify opinion expressions. BiLSTM is also used in Katiyar and Cardie (2016) and , showing improved performance on OEI. Fan et al. (2019) design an Inward-LSTM to incorporate the opinion target information for identifying opinion expressions given their target, which can be seen as a special case of our task. Xia et al. (2021) employ pre-trained BERT representations (Devlin et al., 2019) to increase the identification performance of joint extraction of the opinion expression, holder and target by a span-based model.
All the above studies are in English and based on the MPQA , or customer reviews (Wang et al., 2016(Wang et al., , 2017Fan et al., 2019) since there are very few datasets available for other languages. Hence, we construct a large-scale Chinese corpus for this task by crowdsourcing, and borrow a novel representation learning model  to handle the crowdsourcing annotations. In this work, we take the general BERT-BiLSTM-CRF architecture as the baseline, which is a competitive model for OEI task.
Crowdsourcing as a cheap way to collect a largescale training corpus for supervised models has been gradually popular in practice (Snow et al., 2008;Callison-Burch and Dredze, 2010;Trautmann et al., 2020). A number of models are developed to aggregate a higher-quality corpus from the crowdsourcing corpus (Raykar et al., 2010;Rodrigues et al., 2014a,b;Moreno et al., 2015), aiming to reduce the gap over the expert-annotated corpus. Recently, modeling the bias between the crowd annotators and the oracle experts has been demonstrated effectively (Nguyen et al., 2017;Simpson and Gurevych, 2019;Li et al., 2020), focusing on the label bias between the crowdsourcing annotations and gold-standard answers, regarding crowd-sourcing annotations as annotator-sensitive noises.  do not hold crowdsourcing annotations as noisy labels, while regard them as ground-truths by the understanding of individual crowd annotators. In this work, we follow the idea of  to explorate our crowdsourcing corpus, and further propose the annotator mixup to enhance the learning of the expert representation for the test stage.

Conclusion
We presented the first work of Chinese OEI by crowdsourcing, which is also the first crowdsourcing work of OEI. First, we constructed an extremely-noisy crowdsourcing corpus at a very low cost, and also built gold-standard dataset by experts for experimental evaluations. To verify the value of our low-cost and extremely-noisy corpus, we exploited the annotator-adapter model presented by  to fully explore the crowdsourcing annotations, and further proposed an annotator-mixup strategy to enhance the model. Experimental results show that the annotator-adapter can make the best use of our crowdsourcing corpus compared with several representative baselines, and the annotator-mixup strategy is also effective. Our final performance can reach an F-score of 54.59% by exact matching. This number is actually highly competitive by referring to the model trained on expert annotations (57.08%), which indicates that crowdsourcing can be highly recommendable to set up a Chinese OEI system fast and cheap, although the collected corpus is extremely noisy.

Ethical/Broader Impact
We construct a large-scale Chinese opinion expression identification dataset with crowd annotations. We access the original posts by manually traversing the relevant Weibo topics or searching the corresponding keywords, and then copy and anonymize the text contents. All posts we collected are openaccess. In addition, we also anonymize all annotators and experts (only keep the ID for the research purpose). All annotators were properly paid by their actual efforts. This dataset can be used for both the Chinese opinion expression identification task as well as crowdsourcing sequence labeling.

A Annotation Guideline
In this annotation task, we will give a number of sentences that have a high probability of expressing positive or negative sentiment, and your goal is to label the words that expresses these sentiments in each sentence. An intuitive criteria for determining whether words are expressing sentiment is that if these words are replaced, the sentiment expressed by the sentence will also change. Sentimental words will not usually be names of people, places, time or pronouns, etc. It is important to note that (1) you need to carefully understand the emotion expressed by the sentence, not judge it according to your own values, and (2) the labeled words usually do not include the target of the sentiment, such as pronouns, names of people, etc., which are generally not affected by the replacement of these words.

B Hyper-parameter Tuning
We also implement the baseline models in the fine-tuning style, results (in Table 5) show that the adapter-based models are comparable and parameter-efficient.
PGN Adapter Layers First, we examine the influence of PGN adapter layers mentioned in §3.2 by p, which is a hyper-parameter in our annotatoradapter. As shown in Table 5, we can see that the performance is stable between p ∈ [6, 8, 10]. After considering both the parameter scale and the capability of our model, we set p = 6 for a trade-off.

Annotator Mixup
The mixup includes a hyperparameter α to control the interpolation by the distribution Beta(α, α). Here we show the influence of α by setting it with 0.2, 0.5, 0.8, and 1.0. We find that the model performance has no significant differences between these values, as shown in Table 5. To train our mixup model, we also have a reasonable small trick: training the mixup model in two stages. First, the model is trained only with the original corpus. When the model achieves the best performance on the devset, we begin the secondstage training by using the original corpus as well as the augmented corpus. Their performance difference is shown in Table 5, which indicates that the two-stage training is important for our mixup model.

C Expert-Evaluation of Crowd Annotators
We evaluate the performance of each learned annotator of three annotator-aware models towards the expert's view. The goal is achieved by using the individual annotator embeddings as input to obtain the output predicted by this specific annotator, and then measure the output performance based on the gold-standard test corpus. Table 7 shows the results. There is a huge discrepancy between the scores of different learned annotators of LSTM-Crowd or annotator-adapter, demonstrating annotators have different abilities in predicting gold labels. This is mainly because the annotators have different abilities meanwhile the annotations they gave have different qualities. All annotators in the annotatoradapter model are unable to outperform the expert (centroid point), verifying that the estimated expert is strong and reasonable. In addition, the learned annotators of our mixup model have closer performances since the annotator-mixup change the learning objective from modeling annotators to modeling the expert, which can further boost the performance of the estimated expert.

D Case Study
For a more intuitive understanding of our task and various models, we offer a paradigmatic example from the test set to analyze their outputs. Table 6 shows the gold annotation and model predictions. As shown, the ALL method can correctly recognize all three opinions, but fails to predict the correct boundaries. The MV method splits one opinion into two, and is able to recall one full opinion expres-

Model Text and Opinions
Gold 现在驱车在这清冷寂寥的街路上，这些热闹的闪亮的 灯光倒让人有心安的感觉。 Now driving on this cold and lonely street, these lively and shiny lights make me ease.
ALL 现在驱车在这清冷寂寥的街路上，这些热闹的闪亮的 灯光倒让人有心安的感觉。 Now driving on this cold and lonely street, these lively and shiny lights make me ease.

MV
现在驱车在这清冷寂寥的街路上，这些热闹的闪亮的 灯光倒让人有心安的感觉。 Now driving on this cold and lonely street, these lively and shiny lights make me ease.

LSTM-Crowd
现在驱车在这清冷寂寥的街路上，这些热闹的闪亮的 灯光倒让人有心安的感觉。 Now driving on this cold and lonely street, these lively and shiny lights make me ease.

Our Vanilla
现在驱车在这清冷寂寥的街路上，这些热闹的闪亮的 灯光倒让人有心安的感觉。 Now driving on this cold and lonely street, these lively and shiny lights make me ease.

Our Final
现在驱车在这清冷寂寥的街路上，这些热闹的闪亮的 灯光倒让人有心安的感觉。 Now driving on this cold and lonely street, these lively and shiny lights make me ease. sion exactly. The LSTM-Crowd is similar to ALL yet slightly better. Both the annotator-adapter and our mixup models can obtain better results for this example. Note that all three opinions are difficult to be fully recognized even by crowd annotators.  Table 7: The F1 scores by using different crowd annotators as input on the gold testset. Exact matching scores are reported. The LSTM-Crowd just learns an estimation of expert assisted by modeling the label bias of annotators, while the annotator-adapter model learns the different understandings of each annotator but not the expert annotations. Our final mixup model is much more stable across different annotators. The observation indicates that, with the application of annotator-mixup, all annotators can learn from each other and improve towards the expert level together, which can enhance the expert-modeling.