ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture

This paper introduces ArtELingo, a new benchmark and dataset, designed to encourage work on diversity across languages and cultures. Following ArtEmis, a collection of 80k artworks from WikiArt with 0.45M emotion labels and English-only captions, ArtELingo adds another 0.79M annotations in Arabic and Chinese, plus 4.8K in Spanish to evaluate “cultural-transfer” performance. 51K artworks have 5 annotations or more in 3 languages. This diversity makes it possible to study similarities and differences across languages and cultures. Further, we investigate captioning tasks, and find diversity improves the performance of baseline models. ArtELingo is publicly available at ‘www.artelingo.org‘ with standard splits and baseline models. We hope our work will help ease future research on multilinguality and culturally-aware AI.


Abstract
This paper introduces ArtELingo, a new benchmark and dataset, designed to encourage work on diversity across languages and cultures.Following ArtEmis, a collection of 80k artworks from WikiArt with 0.45M emotion labels and English-only captions, ArtELingo adds another 0.79M annotations in Arabic and Chinese, plus 4.8K in Spanish to evaluate "cultural-transfer" performance.More than 51K artworks have 5 annotations or more in 3 languages.This diversity makes it possible to study similarities and differences across languages and cultures.Further, we investigate captioning tasks, and find diversity improves the performance of baseline models.ArtELingo is publicly available 1 with standard splits and baseline models.We hope our work will help ease future research on multilinguality and culturally-aware AI.

Introduction
Figure 1 compares and contrasts annotations on WikiArt across language/culture.We believe these differences are interesting and important, and far from random.One might suggest using machine translation to translate English captions to many other languages, but we believe that doing so would miss much of the opportunity.Building humancompatible AI that is more aware of our emotional being is important for increasing the social acceptance of AI.ArtEmis (Achlioptas et al., 2021) is an important step in this direction, introducing a collection of 0.45M emotion labels and affective language explanations in English on more than 80,000 artworks from WikiArt.However, by design, ArtEmis is limited to English, lacking coverage of other cultures and languages.
Cultural differences are a major source of diversity (Meyer, 2014).The customs, social values, lifestyles, and history of different countries and cultures greatly influence human behavior.Emotional experiences are no exception; people from different countries respond differently to similar scenarios.For example, a person born and raised in a Nordic country would be more comfortable in a lush forest than in a desert, but a Bedouin may be more comfortable in a desert than in a forest.
Consider Figure 1c, where an Arabic annotator assigned the image the label contentment, but the other two annotators used the label: sadness.Captions are useful for diving deeper into these differences.The sadness annotations mention death 2 and disasters, 3 in contrast with the contentment annotation that ends with: feeling of satisfaction.
There can be interesting differences between languages/cultures even when annotators use the same label.Consider Figure 1b, where all three labels are contentment.Although the three captions agree on the label, two of the captions imply that some/all of the girls are sisters, but there is no such implication in the English caption.
We believe deep nets will be viewed as more culturally aware, if they can capture linguistic/cultural patterns such as these.Emotions are based on past experience, and play an integral role in determining human behavior.Not only they reflect our internal state but also directly effect how we perceive, interpret external stimuli (Izard, 2009), and how to act based on them (Lerner et al., 2015).Hence, studying emotions is essential to exploring a confounding aspect of human intelligence.
In summary, our contributions are: 1. 0.79M annotations (labels + captions) in Arabic and Chinese, plus 4.8k in Spanish, 2. a benchmark with standard splits, and 3. baseline models for two tasks: (1) label prediction and (2) affective caption generation.
The rest of the paper is organized as follows: related work is discussed in §2, followed by our main motivation in §3, and data collection in §4.
§5 provides qualitative and quantitative analyses of ArtELingo.Baseline models for emotion label prediction and caption generation are presented in §6 and §7, respectively.
2 冬天下雪后到处白雪皑皑，枯树显得很萧条。 (snow everywhere and dying trees is depressing) 3 no me gusta el ambiente, lo primero que me vino a la mente fué un desastre natural con destrucción a su paso (mentions a natural disaster) (a) ArtEmis: I love everything about this painting of a mother and her two children lovingly interacting with the family pet cat.
(b) COCO: A man and a woman holding a little kid while sitting at a table outside

Captions with Emotions
Work on captioning is moving beyond factual captions in early benchmarks such as COCO (Lin et al., 2014).Figure 2 shows two images of families, one from ArtEmis and the other from COCO.Both captions capture the facts, but ArtEmis enhances the facts with emotion/commentary.
Table 1 compares three benchmarks: COCO (Lin et al., 2014), ArtEmis and ArtELingo.ArtEmis encourages work on emotions by replacing COCO photos with WikiArt,4 and by introducing 9 emotion classes, 4 positive,5 4 negative6 and Other.ArtELingo encourages researchers to work on visually grounded multilinguality by providing affective annotations in three languages (henceforth, ACE/ACES): Arabic, Chinese and English.In addition, we provide a small set of Spanish (S). Figure 3 shows that positive emotions are more frequent than negative emotions, especially in Arabic.

Related Work in Other Fields
There is a considerable literature on emotions, especially in Psychology (Russell and Barrett, 1999).
Bias is the flip side of inclusiveness.There has been considerable discussion recently about biases (Bender et al., 2021;Bolukbasi et al., 2016;Buolamwini and Gebru, 2018;Mehrabi et al., 2021;Liu et al., 2021).Some of this work is more relevant to our interest in Chinese (Jiao and Luo, 2021;Liang et al., 2020), and Arabic (Abid et al., 2021).Many machine learning methods will, at best, learn what is in the training data.There have been some attempts to remove biases in corpora, but it might also be constructive to create more inclusive benchmarks such as ArtELingo.
Awareness of different cultures is becoming increasingly important.Gone are the days when it was sufficient for datasets to focus on a single culture.Recently, the Vision & Language community has been producing more multicultural multilingual datasets (Bugliarello et al., 2022;Srinivasan et al., 2021;Armitage et al., 2020).ArtELingo contributes cultural diversity over emotional experiences.The effect of culture on psychology has been studied in separate studies (Henrich et al., 2010;Abu-Lughod, 1990;Norenzayan and Heine, 2005).ArtELingo provides empirical evidence that might motivate cultural psychology studies.

Opportunities for Improvement
Many of the resources mentioned above have advanced our understanding of the relationship between emotion and various stimuli, through there are always opportunities for improvement.We are particularly interested in three such opportunities: scale, multimodality and multilinguality/muliculturalism.As for scale, demand for larger training sets is expected to continue to increase, given the rise of large scale foundation models (Bommasani et al., 2021).
As for multimodality, although most benchmarks mentioned above focus on a single modality, there are a few multimodal exceptions such as IEMO-CAP (Busso et al., 2008), COCO and ArtEmis.IEMOCAP collected speech and facial and hand movements of 10 actors.Unfortunately, this approach may be expensive to scale up.

The use of Amazon Mechanical Turk in
ArtEmis is easier for scaling, however, ArtEmis is limited to English.ArtELingo addresses multilinguality/multi-culturalism by adding Arabic and Chinese annotations.We use languages as a proxy to reflect different cultures.English is a representative sample of the West, and Chinese is a representative sample of the East, and Arabic is a representative sample of the Middle East.

ArtELingo
Following ArtEmis, we employ Amazon Mechanical Turk (AMT) platform to collect our data using interfaces ( see Figures 8, 9, 10 in the appendix).
We faced a lack of Arabic and Chinese speaking annotators on AMT which led us to devise different strategies to recruit annotators.Arabic speakers were recruited by advertising the task in middle eastern universities encouraging students and their families to join our data collection efforts.Whereas Chinese speakers were recruited through Baidu who we'd like to thank.
Annotators are asked to carefully examine each artwork before selecting the dominant emotion induced by it from a list of four positive, four negative  emotions, and Other to indicate a different emotion.Annotators are then asked to write captions that reflects the content of the artwork and explains their choice of emotion.Similar to ArtEmis, we collect annotations from five annotators for each artwork.
For a better cultural representation in ArtELingo, we restrict the collection of different languages annotations to countries with large numbers of native speakers.Chinese data is collected from China.For Arabic, we collect our data mainly from Saudi Arabia and Egypt.Finally, Spanish is collected from Latin America and Spain.Figure 4 shows that most of the annotations are from a long tail of workers who annotated less than 1000 artworks ensuring a diverse representation of cultures.Quality Control.Annotations were rejected if they are too short, or if they are too similar to captions for other artworks.In addition, a manual review was conducted by multiple reviewers, ensuring captions reflect the selected emotion label and the details of the artwork.Table 3 reports some statistics on annotations that passed this review process.

Dataset Analysis 5.1 Qualitative
There are some interesting similarities and differences between language and culture, as discussed in Figure 1.There is a considerable inter-annotator agreement (IAA) in the dataset, and there are also some interesting disagreements.There is agreement in Figure 2a that a mother's love is universally warm and pleasant.It is an instinct for mothers to be loving, caring and protective of their children. 14 On the other hand, there is a difference in Figure 1a.
All three annotators agree to observe a waterfall though some mention energy and growth, while others saw horses and wedding veils.Agreement is computed as a log likelihood agreement score, A = log 2 (P r(G|D)/P r(G|U )), where G is one of the 10 genres, and U and D are two sets of artworks.Let P r(G|U ) be the fraction of artworks in U with genre G, and P r(G|D) be the fraction of artworks in D with genre G.

Quantitative
Let U be the universal set of artworks.That is, U contains all artworks in ArtELingo with 5 14 English caption for Figure 2a highlights the cat, whereas the Arabic and Chinese focus on the family and do not mention the cat: 女人看着自己的孩子，让人觉得很开心。 15 The 9 emotion classes are: Amusement, Awe, Contentment, Excitement, Anger, Disgust, Fear, Sadness, and Other 16 The 10 genres are: portrait, landscape, genre painting (misc), religious painting, abstract painting, cityscape, sketch and study, still life, nude painting and illustration. 17The  Table 4 shows that there is more agreement for some genres (landscapes), and more disagreement for other genres (sketches).When the agreement score is near 0, then the genre is about equally likely in U and D. This is to be expected for genres near the middle of the list such as misc.Figure 5 shows 8 artworks in genres with high agreement and high disagreement.Figure 6 reports the Cohen's Kappa score of annotations from language pairs.Annotators belonging to the same language have higher agreement.
We created D for zero-shot experiments to be reported in §6.The 4.8k Spanish annotations in Table 3 are on the set of D artworks with low IAA (inter-annotator agreement) in ACE (Arabic, Chinese and English).

Emotion Label Prediction
Baseline models for two tasks, emotion label prediction and caption generation, will be discussed in this section and the following section.These discussions assume familiarity with deep nets including fine-tuning BERT (Devlin et al., 2019) and cross language models XLM (Conneau et al., 2020), as well as HuggingFace (Wolf et al., 2019).Emotion Classification.Given an input caption, c, we wish to predict an output emotion label, ê, where ê is one of the 9 emotions.The model starts with a pretrained language model, LM , and a tokenizer.The tokenizer converts c into a sequence of L tokens x.The language model converts x into more useful representation, LM (x) ∈ R L×d , where d is the number of hidden dimensions (a property of the LM).Finally, we feed LM (x) into a linear layer to predict the emotion label, ê.Majority Baseline.We use the majority emotion label for each artwork as the predicted emotion for all captions belonging to that artwork.Concretely, each artwork, I, has a set of caption-emotion pairs, S. The majority classifier outputs the most frequent emotion, ê, in the set S for all of the captions in the set, c ∈ S, Language Models.We finetune 3 models based on BERT (BERT-E, BERT-A and BERT-C), where BERT-E is tuned for English, and BERT-A is tuned for Arabic and BERT-C is tuned for Chinese.Section 11.2 discusses more pretraining and finetuning details.We also finetune 4 models based on cross language models, XLM-roBERTa (Conneau et al., 2020), where XLM-E, XLM-A and XLM-C correspond to English, Arabic, and Chinese languages, as before.In addition, we create XLM-ACE by training on the combination of all 3 languages.3-Headed Transformer.Finally, we create a model with XLM-R backbone but replace the single classifier head with 3 classifier heads, one for each of the 3 languages.While training, we feed the captions from each language to the shared backbone and then use the corresponding head to predict an emotion that would ultimately reflect the culture of that language.Geva et al. (2021) analyzed similar multi-headed transformers and showed how the non-target heads can be used to interpret the results of the target head.Similarly, our 3-headed transformer can be used to predict 3 different emotions each one reflecting the culture norms represented in each language.We can then use these predictions to better understand the similarities and differences combines Arabic (A), Chinese (C), and English (E)."M" stands for mode where the majority vote between the 3 heads is used.For Spanish we evaluate the models without any finetuning (Zero-Shot prediction).between cultures.Experimental Setup.We use the base versions of both the BERT and XLM-R models with their default tokenizers from HuggingFace.We use the standard finetuning procedure where we use the ADAM optimizer to finetune the model for 5 epochs on batches of size 32 with learning rate of 2 × 10 −5 .We use cross entropy as the loss function for updating the full model parameters, including the transformer backbone.We follow the standard ArtEmis (Achlioptas et al., 2021) splits introduced in (Mohamed et al., 2022) and adopt them for both Arabic and Chinese datasets.The same training and testing images are used in all cases.For BERT models, we only evaluate on the same language as the training set because BERT tokenizers are language specific.Baseline Results.Table 5 reports accuracy for several BERT/XLM models.There are 4 test sets, one for each language, plus ACE (a combination of 3 languages).XLM models perform better than BERT, because there is no data like more data, as well as the cross language setup used during pretraining.Interestingly, scores on the Chinese test set are higher than for English and Arabic, suggesting that Chinese captions are easier to classify.Finally, notice that XLM-ACE (XLM trained on 3 languages) outperforms other conditions, showcasing benefits of multiple languages.Note that XLM-ACE even outperforms matching conditions, where training language = test language.3-Headed Transformer Analysis.Although the 3-Headed transformer did not improve accuracy, the 3 classification heads are useful for error analysis.We feed the entire ArtELingo dataset to the model and predict 3 ê values, one for each head/language.Confusion matrices are reported in Figure 7.There is more agreement on negative emotions, and less agreement on positive emotions.
We are interested in large off-diagonal values in Figure 7, especially between positive and negative emotions.For example, Arabic disgust is often confused with English amusement.
Upon further investigation, we found nude paintings contributed ∼15% of these confusions.Explicit content and alcohol are frowned upon in some Arabic speaking communities, as illustrated by the second and third rows of Table 6, where the label is positive in English and Chinese, but not in Arabic.
Religious symbols are also associated with large off-diagonal values in confusion matrices.The first row in Table 6 mentions Jesus and how a beautiful girl holds his cross and stomps on the devil.The annotation is positive (awe) in English and Arabic, but negative (fear) in Chinese.In China, the cross holds less meaning, and stomping on the devil is more scary than reassuring.Many symbols are associated with religion, holidays and legends that mean more in some places than others. 19 While there are a few off-diagonal cells with large values, most of the large values in the confusion matrices are on the main diagonal.That is, the similarities across languages tend to dominate the differences.Consider the last row in Table 6, 19 Dragons are positive in East, but negative in West.

Awe
Awe Fear [E] The woman on the ground isn't wearing any clothes Amu.Dis.Amu.
[E] The man looks like he's drunk since his expression is so wired out Amu. Sad Exc.
[C] Countless babies have descended into the world, giving life to the world and making people feel happy. Cont.Cont.Cont.
Table 6: Predictions from 3-Headed Transformer: The input is a caption in Arabic (A), Chinese (C) or English (E).The first column shows the language and a gloss.The last three columns show predictions for each head (with interesting differences across heads).
which receives a positive label (contentment) in all 3 languages.Babies make people feel happy (nearly) everywhere.In this case, all 3 heads of our 3-headed transformer predict positive labels for this caption.For training models across multiple languages, similarities across languages may be more useful than differences.
Zero-Shot Evaluation.We use Spanish annotations in ArtELingo to evaluate models mentioned above in a zero-shot setting.The last column in Table 5 reveals two interesting relations: The first relation suggests that 3-Heads may not perform as well as XLM when there is plenty of data, but 3-Heads may have advantages in low-resource and zero-shot settings.3-Heads are better for capturing interactions between languages.
The second relation suggests that language transfer may be more effective across some language pairs than others.Historically, Spanish and English are both relatively close Indo-European languages, 20 compared to Semitic languages such as Arabic.There has been much less contact (Thomason, 2001) between those languages and Chinese.

Affective Caption Generation
The previous section described baseline models for the first task: label prediction.This section will describe baseline models for the second task: affective caption generation.
To this end, we follow Achlioptas et al. ( 2021) and train two affective captioning models: Show, Attend, and Tell (SAT) (Xu et al., 2015) and Meshed Memory Transformer (M 2 ) (Cornia et al., 2020).We use Affective Captioning Models to refer to captioning models that generate affective captions.These captions connect the dots between input paintings and emotions.
SAT is a LSTM (Hochreiter and Schmidhuber, 1997) based captioning model with an attention module, it consists of a visual encoder and a text decoder.The visual encoder extracts visual features from an input image.The decoder then uses a stack of an attention module and LSTM recurrent unit to generate a caption autoregressively.M 2 is a transformer based model (Vaswani et al., 2017) which utilizes a pretrained Faster-RCNN (Ren et al., 2015) object detector to extract visual region features.These features are used as an input sequence to a multi-layer attention based encoder.M 2 differs from basic transformers by feeding the encoded features from all encoder layers to the cross attention module in each decoder's layer.In order to include Emotion and Language grounding, we use a simple embedding layer to convert the emotion and language labels into feature vectors and then concatenate them to the visual features.
Experimental Setup.For both models, we use the default parameters proposed in (Achlioptas et al., 2021).We train four different versions of each model, three versions are trained on English, Arabic, and Chinese only datasets, while the fourth version is trained on the three languages combined.We then test all the models on all the languages.In Results.We report the results of our baseline models in Table 7. Models trained using all the languages perform very similarly to their language specific counterparts on every metric except for the Chinese language.This provides additional evidence that English and Arabic speaking cultures are more closely related to one another than either is to Chinese ones.In other words, English captioning models do not lose much performance when Arabic data is added to the training set and vice versa.On the other hand, Chinese models suffer when such data is added.Moreover, we also observe that for models trained on single languages, the scores on the combined test set is proportional to the language specific test sets.

Conclusion
This paper introduced ArtELingo, a multilingual dataset and benchmark on WikiArt images with more than 1.2M captions and emotion labels.The benchmark has diverse emotional experiences constructed over different cultures, and communicated in four languages (English, Chinese, Arabic, and Spanish).We found more agreement for some genres such as landscapes and more disagreement for other genres such as sketches.These differences are interesting and important, and far from random.Annotations for trees in Figure 1c are labeled as sadness in English and Chinese but contentment in Arabic.People are likely to feel more comfortable with what they know.People raised in countries with lush forests are likely to prefer that, whereas people brought up in less humid environments are likely to prefer that.Towards building more socially and multiculturally aware AI, we created baseline models for two tasks on ArtELingo: (1) emotion label prediction and (2) affective caption generation.For emotion label prediction, our best baseline model trained XLM on a combination of training data from all three languages (XLM-ACE).We also created 3-headed transformers, training three heads for three languages (Arabic, Chinese, and English) at the same time.The performance of this model is close to XML-ACE, but generalizes better in a zero-shot experiment on Spanish.For the caption generation task, we trained two models on SAT and M 2 .For English and Arabic, models on all three languages have a similar performance to language specific models, but for Chinese, it is best to train without the other languages since the performance drop is significant.
We hope our benchmark and baselines will help ease future research in visually-grounded language models that can communicate affectively with us.In addition, ArtELingo can provide empirical examples of cross-cultural similarities and differences.Sociologists and Cultural Psychologists may formulate hypotheses and conduct field studies based on ArtELingo.Data, code, and models are publicly available at www.artelingo.org/.

Limitations
ArtELingo's artworks are extracted from WikiArt.Although ArtELingo is diverse in language and culture, it inherits WikiArt's bias toward western artworks as discussed in Table 2 in §3.1.There is room to improve the representation of certain regions of the world.Due to globalisation, people tend to follow similar trends around the world, causing others to follow their lead (for better and for worse).
Many cultures, such as Arabic, do not have a rich heritage of oil paintings.Instead, they have other forms of Art like poetry and calligraphy.Such art forms are interesting to study on their own, but mixing them with paintings is not obvious.Based on the original ArtEmis dataset, we chose WikiArt with the intent to be a continuation of their work.Also, artworks are more accessible and can be interpreted easier by different cultures compared to poetry and other art forms.
The addition of affective captions for Arabic, Chinese, as well as a small set of Spanish is a step toward cultural diversity.However, more than four regions and languages are indeed needed to cover the world.Scalability can be a challenge.However, we hope that progress can be accelerating by developing affective vision and language models that can learn with limited data for each additional language by distilling knowledge from languageonly models as in (Chen et al., 2022;Alayrac et al., 2022).
ArtELingo was also collected through AMT's online platform 22 .This suggests that the workers are familiar with technology and social media, imposing an influence on the data.Social media influences many concepts such as: trending news, and standards, which may lead to the presence of similarities between cultures.There have been, of course, other concerns about the use of AMT and the so-called "gig" economy and workers' rights.task takes on average 50 seconds to complete.In addition, we paid bonuses (mostly 30%) to workers who submitted high-quality work.
The workers were given full-text instructions on how to complete tasks, including examples of approved and rejected annotations (please refer to §11.4).Participants' approvals were obtained ahead of participation.Due to privacy concerns from IRB, comprehensive demographic information could not be obtained.

Figure 1 :
Figure 1: ArtELingo, a multilingual dataset and benchmark of WikiArt with captions & emotions

Figure 2 :
Figure 2: COCO captures the facts, and ArtEmis enhances those facts with emotion/commentary.

Figure 6 :
Figure 6: Cohen's Kappa for inter-annotator and crossannotator agreement.Higher value means more agreement.

Figure 7 :
Figure 7: Confusion Matrices The heatmaps show confusion matrices comparing predictions from the 3-Headed Transformer with ground truth.

Figure
Figure 8: Arabic Interface

Table 2 :
WikiArt is more representative of the West 3.1 Representation of Regions in WikiArtArtELingo assumes that WikiArt is a representative sample of the cultures of interest.While WikiArt is remarkably comprehensive, Table2suggests the WikiArt collection has better coverage of the West than other regions of the world.This table is based on WikiArt's assignment of artworks to nationalities.9We assigned each nationality to West (English 10 and Non English 11 ), Middle East (Arabic 12 and Non Arabic 13 ), East (Chinese) and Other.

Table 3 :
Size of the annotation effort by language.

Table 4
reports multicultural agreement over the 9 emotions 15 in each genre.WikiArt classifies artworks into 10 genres, 16 as well as 27 styles 17 .

Table 5 :
Emotion Label Classification Baselines.Majority baseline output the most frequent emotion for each artwork.Models are fine-tuned on BERT and XLM backbones.Accuracy is best for XLM-ACE."ACE"

Table 7 :
Affective Captioning Baseline.SAT and M 2