Personal Bias in Prediction of Emotions Elicited by Textual Opinions

Analysis of emotions elicited by opinions, comments, or articles commonly exploits annotated corpora, in which the labels assigned to documents average the views of all annotators, or represent a majority decision. The models trained on such data are effective at identifying the general views of the population. However, their usefulness for predicting the emotions evoked by the textual content in a particular individual is limited. In this paper, we present a study performed on a dataset containing 7,000 opinions, each annotated by about 50 people with two dimensions: valence, arousal, and with intensity of eight emotions from Plutchik’s model. Our study showed that individual responses often significantly differed from the mean. Therefore, we proposed a novel measure to estimate this effect – Personal Emotional Bias (PEB). We also developed a new BERT-based transformer architecture to predict emotions from an individual human perspective. We found PEB a major factor for improving the quality of personalized reasoning. Both the method and measure may boost the quality of content recommendation systems and personalized solutions that protect users from hate speech or unwanted content, which are highly subjective in nature.


Introduction
Emotions are a very important component of natural human communication. Collectively, we tend to react quite similarly emotionally to phenomena around us, but at the level of the individual, some differences can be discerned in the intensity of the emotions experienced. Various emotional models have been used in different studies. In Russell and Mehrabian (1977), emotional states are located in a multidimensional space, with valence (negative/positive), arousal (low/high) and dominance explaining most of the observed variance. Another approach distinguishes different number of basic, discrete emotions, e.g. six by Ekman and Friesen (1976) and eight by Plutchik (1982).
We can observe continuous interest in sentiment analysis and emotion recognition within the filed of natural language processing (Kocoń and Maziarz, 2021;Alswaidan and Menai, 2020;Kanclerz et al., 2020). Recently, they commonly rely on deep machine learning methods applied to large amounts of textual data (Yadav and Vishwakarma, 2020;Kocoń et al., 2019b;Kocoń et al., 2019). Nevertheless, emotion recognition remains a challenging task. One of the reasons is the lack of high quality annotated data, where annotators are a representative sample of the whole population. Commonly, a small number (usually 2 to 5) of trained annotators are involved. Due to differences between individual opinions, reinforced by multiple choice possibilities (6 or 8 emotions), this often leads to low inter-annotator agreement (Hripcsak and Rothschild, 2005). Averaging the annotations collected in such a way can still be a good input for effective systems recognizing the most likely emotional responses shared by most people. This, however, is not suitable to make accurate inferences about emotions to be evoked in specific individuals.
In this work, we developed a method to predict text-related emotions that most closely reflect the reactions of a given reader. In addition to the classical approach of providing only texts to the model input, we extended it with our new feature -Personal Emotional Bias (PEB). It reflects how an individual perceived the texts they evaluated in the past. In this way, we switched from averaging labels for annotated texts to individual text annotations. We tested the impact of PEB on individual recognition quality of emotion dimensions, also in a setup including multilingual transformer-based architecture for the following languages: Dutch, En-glish, Polish, French, German, Italian, Portuguese, Russian and Spanish. Our experimental evaluation revealed that emotional annotation of just a few texts is appears to be enough to calculate the approximate value of Personal Emotional Bias for a given user. This, in turn, enables us to significantly improve personalized reasoning. Since texts are independently annotated with ten emotional states, each with its own level, we trained and tested both multi-task classifiers and multivariate regressors. This work is inspired by our initial idea of human-centred processing presented in . In addition, in paper (Kanclerz et al., 2021), we have shown that mixing user conformity measures with document controversy is efficient in personalized recognition of aggressiveness in texts.

Related work
The studies have shown that the recognition of emotions should take into account the subjective assessments of individual annotators (Neviarouskaya et al., 2009;Chou and Lee, 2019;Kocoń et al., 2019a). A personal bias related to the individual beliefs may have its origins in the demographic background and many factors such as the first language, age, education (Wich et al., 2020a;Al Kuwatly et al., 2020), country of origin (Salminen et al., 2018), gender (Bolukbasi et al., 2016;Binns et al., 2017;Tatman, 2017;Wojatzki et al., 2018), and race (Blodgett and O'Connor, 2017;Sap et al., 2019;Davidson et al., 2019;Xia et al., 2020). The uniqueness of person's annotations may also be derived from their political orientations and not respecting them can significantly reduce the effectiveness of the classifier (Wich et al., 2020b).
The most common approach to mitigate the impact of personal bias on method performance is to utilize only annotations provided by the experts (Waseem, 2016). However, we should be aware that selecting a small group of experts poses a risk of involving too few annotators for too many documents (Wiegand et al., 2019) or creating unfair models, that will discriminate minorities (Dixon et al., 2018). Besides, it may be difficult to find the sufficient number of experts. To resolve this, nonexpert annotators can be involved. An average of annotations from non-expert is enough to achieve expert-level labeling quality (Snow et al., 2008). Personal bias also affects the model evaluation process. Therefore, annotations from a separate set of annotators should be used in the training and test set (Geva et al., 2019).
The high variety of annotators' beliefs directly impacts the diversity of their subjective assessments. It often means that there is no single correct label for a given text . In such case, Bayesian probabilistic models can be used to estimate consensus level, which can then be converted to categorical values using simple methods, e.g. thresholding (Kara et al., 2015). Another solution is to regard disagreement in annotations as a positive factor that will provide more information about single humans. This ambiguity can be utilized in many ways. Patterns discovered from differences in annotations can be exploited both to group like-minded individuals (Akhtar et al., 2020) and to automatic detect spammers, deliberately introducing noise into their assessments (Raykar and Yu, 2012;Soberón et al., 2013). On the other hand, too high annotations similarity level may be related to the conformity bias, which reflects an excessive influence of the group's beliefs on its members (Gao et al., 2019). Moreover, annotation disagreement can determine the ambiguity of a given text . The variability between annotators can also be used to generate soft labels such as inter-annotator standard deviation, which may be an additional feature of a given sample (Eyben et al., 2012). Such soft labels can also be a good source of information about annotators themselves, e.g. to estimate the unanimity of a specific social group in recognizing emotions (Steidl et al., 2005). Another approach is to leverage the ensemble model architecture to incorporate knowledge regarding the subjectivity of emotion recognition (Fayek et al., 2016). In order to reduce the potential noise caused by relying solely on subjective annotations, a hybrid method can be applied mixing both individual ratings and majority voting (Chou and Lee, 2019). The final model consists of multiple sub-models using annotations of individuals separated and combined. All sub-models are fused providing one general and non-personalized decision.
The topic of emotion personalization was explored in the context of social photos (Zhao et al., 2016) or emotions evoked by music (Yang et al., 2007). However, in the context of text analysis, it has not been studied sufficiently yet.

Dataset and annotation procedure
To create a Sentimenti 1 dataset, a combined approach of different methodologies were used, namely: Computer Assisted Personal Interview (CAPI) and Computer Assisted Web Interview (CAWI) (Kocoń et al., 2019a). Two studies were carried out involving evaluation of: 30,000 word meanings (CAWI1) and 7,000 reviews from the Internet (CAWI2). Reviews cover 3 areas: medicine (3,130 texts), hotels (2,938 texts), and other (936 texts). In this work, we will focus on the use of CAWI2 due to the evaluation of entire documents within the study.
In the CAWI2 study, each text received an average of 50 annotations. To obtain reliable results, the following cross-section of the population was used: 8,853 unique respondents were sampled from the Polish population. Sex, age, native language, place of residence, education level, marital status, employment status, political beliefs and income were controlled, among other factors.
The annotation schema was based on the procedures most widely used in NAWL , NAWL BE  and plWordNet-emo (Zaśko-Zielińska et al., 2015;Janz et al., 2017;Kocoń et al., 2018;Kulisiewicz et al., 2015). Therefore, the acquired data consists of ten emotional categories: valence, arousal, and eight basic emotions: sadness, anticipation, joy, fear, surprise, disgust, trust and anger. Mean text rating distributions within emotional categories are presented in Figure 1. In total, 7k opinions * average of 53.46 annotators per opinion * 10 categories = 3.74M single annotations were collected.
The annotation process was carried out using the web-based system with an interface designed in collaboration with the team of psychologists to reduce as much as possible the difficulty of handling the annotation process and its impact on grades or their quality (see Figure 2). The collection resulting from the study is copyrighted and we got permission to conduct the research. A sample containing 100 texts with annotations and annotators' metadata with the source code of the experiments are publicly available on GitHub 2 .

Personal Emotional Bias -PEB and agreement measures
In principle, we assume our collection (Internet review documents) is split into three partitions: past (D past ), present, and future ( Figure 3). The past texts are used to estimate individual user beliefs and biases. The present documents allow us to train the reasoning model, whereas the future reviews are for the evaluation, test purposes.
To quantify individual subjective emotional perception of textual content, we introduce a new measure -Personal Emotional Bias, P EB(u, c). It describes to what extent the previously known annotations v c,d,u of the given user u differ from the average annotations provided by all others for emotional category c, aggregated over all documents d ∈ D past . Emotional category c ∈ C, where C = {sadness, anticipation, joy, f ear, surprise, disgust, trust, anger, valence, arousal}. Integer values of the emotional annotations v c,d,u come from the study design, Figure 2 First, we need to compute the mean emotional value µ c,d of each document d ∈ D past in each category c over all previously known d's annotations, i.e. provided by users from the train data, In the next step, we calculate the standard deviation σ c,d of each emotional category c for each document d in a similar way:

This is our favorite place in the Giant Mountains, so we're biased. The cuisine is excellent (fantastic trout or Hungarian cake), delicious honey beer from our own brewery and the palace is getting prettier and prettier. This time
we used only the restaurant, but next time we will also stay in the hotel again. We will come back here many times. Figure 2: Emotional annotations for a real example of the hotel review -the CAWI study. Participants scored eight basic emotions (Plutchik model), arousal and valence on separate scales; varying from 0 to 4 for emotions and arousal and -3 to 3 for valence. Example review was manually translated from Polish to English.
Based on the above knowledge, we can estimate the Personal Emotional Bias P EB(u, c) of the user u for the emotional category c. It is an aggregated Z-score, as follows: is the set of documents d ∈ D past annotated by user u.
Please note that P EB(u, c) may be calculated for any user, who provided their annotations to any document d ∈ D past . It means that we can estimate PEB for users from the dev and test set, always aggregated over past documents. Nevertheless, components µ c,d and σ c,d are fixed and computed only based on the previously known knowledge, i.e. for users from the train set. Obviously, the train, dev, and test sets are different for each out of ten crossvalidation folds, which forces the recalculation of all PEB values at each fold.
The PEB measure provides us information about the unique views and preferences of the individual user. We suspect PEB to be more informative in the case of ambiguous texts with relatively low agreement among the annotators. To measure this agreement we leveraged two different document controversy measures: (1) the averaged Krippendorff's alpha coefficient α int (Krippendorff, 2013) and (2) the general contr std controversy measure. The former is commonly used; it is resistant to missing annotations (Al Kuwatly et al., 2020;Wich et al., 2020a;Binns et al., 2017). According to our data, we used the variant of Krippendorff's alpha coefficient α int with the interval difference function δ interval (v c,d,u , v c,d,u ) which calculates the distance between the two annotations v c,d,u and v c,d,u for document d provided by two different users u and u regarding emotional category c: Our first emotional controversy measure is expressed by the Krippendorff's alpha coefficient α int c separately calculated for the specified emotional category c ∈ C.
The alternative second measure contr std (d) was also used to analyze the controversial nature of any document d. It is the standard deviation of user ratings averaged over all emotional categories c ∈ C: contr std (d) = c∈C σ c,d |C|

Experimental plan, scenarios
All experiments were performed for two types of machine learning tasks, Figure 4: • Multi-task classification -where each task was to predict an accurate discrete answer for each emotional category, i.e. one of the Figure 3: The CAWI2 collection was divided by the texts (columns) and the users/annotators (rows). The past texts (15% of all) were used to compute the PEB measure. The models were trained on 55% of the present texts and 80% of all users. They are verified with the dev set (disjoint from train) and tested on the test set -both containing 10% of users and 15% of texts each. The aforementioned proportions were chosen so that there were at least 1000 texts and more than 500 annotators in each section. The user-based split into train, dev and test is performed in the 10-fold cross-validation schema.
five classes {0, 1, 2, 3, 4} for eight emotions and arousal, and one out of seven classes for valence. Due to data imbalance ('0' was the dominating class for most emotions), the F1-macro measure was used to estimate the model performance; • Multivariate regression -where the task was to estimate the numerical value of each emotional category. Such approach takes into account the distances between user ratings. Rsquared measure was applied to compute the model quality.
In order to investigate the effect of PEB on emotion recognition for individual annotators, the following scenarios of the input data were considered: • AVG -mean value of the annotation (regression) or most common class (classification) for all texts compared to the target values; this scenario is treated as initial baseline; • TXT -text embeddings; it was the main baseline; Figure 4: Two approaches to reasoning: (1) 10-task classification and (2) multivariate regression. In (1), the output contains 10 out of 52 classes. In (2), the output contains 10 real values, one for each emotional category. V -valence, A -emotional arousal.
• TXT+DEM -text embeddings and annotator demographic data; • TXT+PEB -text embeddings and annotator's PEB; • ALL -text embeddings, demographic data and PEB; Additional SIZE scenario was performed to examine the impact of the number of annotated texts in PEB on the emotion recognition quality. As a source of text embeddings the following models for Polish were used: (1) HerBERT, (2) XLM-RoBERTa, (3) fastText and (4) RoBERTa. The first one -HerBERT is currently considered state of the art according to the KLEJ benchmark (Rybak et al., 2020). Two neural network architectures were used to perform the experiments: (1) multi-layer perceptron (MLP) for transformerbased text embeddings (2) LSTM for fastTextbased word embeddings (with 32 hidden units and a dropout equal to 0.5) with MLP to combine LSTM output with additional features. In both cases, the size of the input depends on the input embedding size. MLP output for classification is a multi-hot vector of length 52 (8 emotions x 5 possible ratings, 7 possible valence ratings, and 5 possible emotional arousal ratings), and for regression -a vector of size 10 containing real values ranging from 0 to 1 for each emotion dimension.
Ten fold cross-validation was applied as randomized non-overlapping partition of users and one division of texts, Figure 3. Such an approach is in line with leave-one-subject-out (LOSO) crossvalidation where data is also split according to participants (subjects), i.e. data on one or more users are separated in the test set. Recently, it is commonly treated as SOTA approach in emotion recognition (Barlett et al., 1993;Schmidt et al., 2019) In the SIZE scenario, we verified what incremental gain in model evaluation score we would achieve by increasing the number of texts in PEB ( Figure 5 and Figure 6). The PEB measure denotes how much emotional perception of a given user differs from opinions of other users. To examine the significance of PEB for different emotional dimensions, we calculated the correlation between the PEB model results (R-squared) and the Krippendorff's alpha coefficient α int c for each emotional category c ∈ C.
To investigate the impact of PEB also for multiple languages, we translated Polish texts automatically into 8 languages using DeepL 3 . According to our manual tests and evaluation of translation quality, DeepL is characterized by better context matching of the target language utterances than other solutions available on the market. We applied the original annotations to the translated texts and then prepared dedicated models using XLM-RoBERTa. The training, test and validation sets were identical for all languages. The results are in Table 5 for classification and Table 6 for regression.
In order to verify the significance of differences between the evaluation results of each model in each scenario, we performed the independent samples t-test with the Bonferroni correction, as we tested more than two different models. We also checked the normality assumptions before its execution using Shapiro-Wilk test. If a sample did not meet them, we used the non-parametric Mann-Whitney U test.

Results
The results for all experimental scenarios and models, averaged collectively over ten folds are presented in Table 1 for classification and Table 2 for regression. The performance for each emotional category for all experimental variants for the best model (HerBERT), is specified in Table 3 for classification and Table 4   5 for classification and Table 6 for regression. Figure 5 presents R-squared results of reasoning for the TXT+PEB scenario and HerBERT model in relation to the number of texts from the past set used to estimate personal bias P EB(u, c); averaged over all emotional categories and all users u. The past texts d annotated by user u are either randomly selected or starting from the most controversial, i.e. with the greatest contrstd(d) value among all annotated by u in the past. The component results for each emotion and only for random selection are in Figure 6. Figure 7 depicts the correlation between the annotation consistency counted using Krippendorff's alpha and the prediction performance in the regression task on the best model -HerBERT.

Discussion
The best results for each model were observed in the TXT+PEB scenario. The use of demographic data as additional user characteristics apart from the PEB measure in the ALL scenario did not provided significantly better results. HerBERT model achieved the best results, but differences between models are not statistically significant (except for the Polish RoBERTa).
The performance improvement related to demographic data about individual users was considered in the TXT+DEM scenario. Demographic features encode bias for social groups. However, once we have individual biases (the PEB measure), demographics becomes redundant and negatively affects     the results: compare TXT+PEB vs. ALL.
The PEB measure quantifies the difference in opinions of a particular user with respect to the others. In addition to beliefs, user decisions are also influenced by UI design. Several emotional categories could prove to be incomprehensible to individual users, so that their annotations do not reflect their opinions. Moreover, the scale of values could be misunderstood by some annotators who could mark the middle value when they were unsure whether a given emotional category was present in the analyzed text at all.
The use of simple statistical methods based on the averaged opinion about the text presented in the AVG scenario performs much worse than language models combined with MLP. Predicting the user's opinion solely upon the text in the TXT scenario (our baseline) results in poor performance. Therefore, there is a need to exploit personalized user data. The phenomenon of improving inference thanks to personalization is the same for each of the four considered models. It means that the proper personalization carried out at the stage of input data is much more important than the language model Figure 5: R-squared results on TXT+PEB scenario and HerBERT model in relation to the number of texts from the past set used to compute P EB(u, c) values for a given user u, averaged over all emotional categories and all users. Two text selection procedures were considered: random and the most controversialcontr std (d). The baseline is the TXT scenario. The results for emotion categories and random selection are in Figure 6. or inference model.
In the case of regression models, the complementary nature of the PEB measure and the text itself is clearly visible, see the PEB and TXT scenarios in Table 2, Table 4, and Table 6. This is manifested in a large number of cases in which a higher quality of inference from the text (TXT scenario) corresponds to the lower quality of the PEB-based inference (PEB scenario) and vice versa. In turn, their combination provides very good results. We calculated the correlation value for the results of evaluation over each emotional category and they are equal to -0.558 and -0,970 for the results in Table 3 and Table 4, respectively. We also analyzed the correlation between two values: (1) the sum of the results in the TXT and PEB scenarios and (2) the result in the TXT + PEB scenario. For the regression models, correlations are 0.999, 0.995, 0.896 for the results in Table 2, Table 4 and Table 6, respectively. In a similar way, we computed the correlation values for the results of the classification models; they reach: 0.802, 0.931, 0.257, for data from Table 1, Table 3 and Table 5, respectively.
The performance in the PEB scenario is the lowest for the valence category, which may result from the highest agreement level (α int c = 0.38) and more flat distribution, Figure 1. Simultaneously, the reasoning based on text only (TXT scenario) demonstrated an opposite dependency: its performance is greatest for the highest agreement (va- Figure 6: R-squared results on TXT+PEB scenario and HerBERT model in relation to the number of texts from the past set, randomly selected to compute P EB(u, c) averaged over all users u -the solid lines. The dotted lines of the same color is the baseline for a given category (the TXT scenario).  Table 4. lence) and lowest for low agreements (surprise, arousal and anticipation). It means that the more users disagree, to the greater extent we should rely on personal biases rather than solely on the textual content.
Even only one document annotated by a user utilized to estimate PEB can boost the reasoning, Figure 5. Moreover, only about 5-7 texts provided in the past are enough to capture the personal user beliefs. Later on, the gains are much smaller. This is valid for all emotional categories, Figure 6. The benefit is greater if PEB is computed for 1-3 most controversial texts (contr std ) annotated by a given user.
We have discovered a nearly linear negative correlation between annotators' agreement level (Krippendorff's alpha coefficient) and performance of the regression model based only on the personal bias (PEB), Figure 7.

Conclusions
Summarizing the experiments performed, we can draw several conclusions related to additional data that can be gathered during the annotation process. By means of them, we are able to significantly improve reasoning about emotional categories, i.e. prediction of emotions evoked by the given textual opinion in different people.
The most important conclusion is that the use of our proposed Personal Emotional Bias measure allows for a tremendous gain in prediction scores for the particular annotator. Thus, we have shown that using the current state-of-the-art methods for embedding texts and data from just a few annotations made by an individual user, we can infer the user's perception of emotions with much greater effectiveness. This opens up the possibility of creating dedicated and personalized solutions targeted at specific social groups and individuals we want to reach with a given message.
We have shown that demographic data of annotators have a positive impact on predicting their reactions, however not as much as the answers they provided during the survey itself. In addition, the combination of text content, demographic data and the single PEB feature built on the basis of their historical ratings is even several times better than the quality of responses given by the system based on text data alone.
Such a great influence on the outcome of singleindividual data reveals a completely new direction. The NLP solutions should focus more on good design of the annotation process, its flow and single text-annotation sets rather than on post-processing and generalization of data, i.e. common class labels received by majority voting. The best proof of this thesis is the fact that we are able to successfully ignore the problem of annotator disagreement within a given text and fill in these gaps with human information.
In future work, we want to investigate the effect of individual PEB vector components on recognition quality. Additionally, we want to extend the PEB with information about the averaged annotation value of texts. Finally, the quality of dedicated models for individual emotional dimensions can be compared to the multi-task model presented in this work.