GD-COMET: A Geo-Diverse Commonsense Inference Model

With the increasing integration of AI into everyday life, it's becoming crucial to design AI systems that serve users from diverse backgrounds by making them culturally aware. In this paper, we present GD-COMET, a geo-diverse version of the COMET commonsense inference model. GD-COMET goes beyond Western commonsense knowledge and is capable of generating inferences pertaining to a broad range of cultures. We demonstrate the effectiveness of GD-COMET through a comprehensive human evaluation across 5 diverse cultures, as well as extrinsic evaluation on a geo-diverse task. The evaluation shows that GD-COMET captures and generates culturally nuanced commonsense knowledge, demonstrating its potential to benefit NLP applications across the board and contribute to making NLP more inclusive.


Introduction
Culture plays a significant role in shaping an individual's worldviews, beliefs, behaviours, and communication styles (Spradley, 1987).A considerable portion of what is commonly referred to as commonsense knowledge is not universal but rather culture-specific, including social norms, values, traditions, and more.An example of cultural differences is greetings, which may involve a handshake in Western cultures, bowing in some Asian cultures, a 'namaste' gesture in India, or 'wai' in Thailand.
With AI systems becoming increasingly ubiquitous in society, it is imperative to go beyond the Western cultural perspective (Hershcovich et al., 2022).Lack of cultural awareness may lead to models perpetuating stereotypes and reinforcing societal inequalities (Hutchinson et al., 2020;Ross et al., 2021;Søgaard, 2022), impeding their effectiveness for users from non-Western countries.
In this paper, we focus on a popular model for commonsense reasoning, COMET (Bosselut et al., 2019), which is based on an English language model (LM) and further trained on commonsense inferences collected from North American crowdsource workers (Sap et al., 2019).Consequently, the model exhibits a certain bias towards the North American cultural perspective.As evidenced by Fig. 1, COMET displays limited familiarity with the concept of a German pancake, erroneously interpreting the term "dutch baby" in a literal sense.
We identify a need for more inclusive commonsense reasoning models and propose GD-COMET: Geo-Diverse COMET.As demonstrated in Fig 1 , GD-COMET gained the culturally relevant knowledge to interpret "dutch baby" as a legitimate dish.
GD-COMET is similarly based on an English LM but is trained on a knowledge base of cultural knowledge (Nguyen et al., 2023) prior to training on COMET's original training data.This simple approach is effective, as judged by both human evaluations as well as extrinsic evaluation on a geodiverse task (Yin et al., 2021).GD-COMET can potentially benefit many downstream NLP applications where the user population is diverse.1 2 Background

Commonsense Inference Models
Many NLP tasks require reasoning beyond what is explicitly stated in the text.People fill in those gaps with their commonsense knowledge.NLP models attempt to do the same by leveraging commonsense knowledge bases (KBs) such as ConceptNet (Speer et al., 2017) and ATOMIC (Sap et al., 2019).To achieve better coverage, knowledge models such as COMET (Bosselut et al., 2019) are based on pre-trained LMs and further fine-tuned on KBs, enabling contextually-relevant inferences along the KB's dimensions for new contexts.
COMET and its successors assume the universality of commonsense knowledge, yet much of this knowledge may differ among cultures, in traditions (e.g., duration of a wedding ceremony; Acharya et al., 2021), foods (e.g., what counts as breakfast food; Speer et al., 2017), social norms, and more.

Culture-Aware NLP
While multilingual NLP is a popular topic, cultureaware NLP is under-explored.It is crucial for language technologies to not only serve speakers of a wide variety of languages but also acknowledge that users come from diverse cultures (Hershcovich et al., 2022).Cultural norms and pragmatic aspects differ across speakers from different cultures (Zhou et al., 2023).Nevertheless, English LMs primarily reflect a North-American lens due to training on web data with a US user bias (Cao et al., 2023).
Current work in culture-aware NLP addresses various aspects.One line of work focuses on cultural stereotypes and biases, and ways to measure and mitigate them (e.g., Hutchinson et al., 2020;Ross et al., 2021;Søgaard, 2022).Another line of work analyzes the differences in culturespecific commonsense knowledge, including relational knowledge (Yin et al., 2022), grounding of time expressions (Shwartz, 2022), food-related customs (Palta and Rudinger, 2023) and social values (Lin et al., 2021;Arora et al., 2023).At the same time, there have been efforts to develop benchmarks (Yin et al., 2021;Liu et al., 2021), and adapt models to new cultures (Zhou et al., 2023;Yin et al., 2023).Finally, there are several recent cultural KBs such as StereoKG (Deshpande et al., 2022), Quasimodo (Romero et al., 2019), and CANDLE (Nguyen et al., 2023).CANDLE, which we use in this work, is the most comprehensive among them, containing 1.1M assertions in English about 386 cultures (e.g."A Dutch baby is a German pancake that is baked instead of cooked on the stove top").CANDLE assertions were extracted from a large web corpus and clustered into 5 facets of culture: food, drinks, clothing, rituals, and traditions.

GD-COMET
We present GD-COMET, a geo-diverse version of COMET.The goal of GD-COMET is to generate high-quality commonsense inferences for concepts and events pertaining to both Western and non-Western cultures.Rather than collecting a large-scale geo-diverse dataset in the style of ATOMIC, we split the training into two phases: (1) training the underlying LM on geo-diverse data; (2) continue training on the large-scale original COMET training data.This is motivated by Bosselut et al. ( 2019) that showed that implicit commonsense knowledge from underlying LM's pretraining transfers to COMET.We similarly hypothesize that encoding geo-diverse data into the underlying LM prior to training on COMET data will transfer this knowledge to GD-COMET.

Geo-Diverse Training (GD-BART).
We pick 770,000 assertions from CANDLE with a combined score greater than 0.5.This threshold selects highly distinctive assertions specific and relevant to the specific region.We fine-tune BART-Large, the underlying LM of the latest COMET model (Hwang et al., 2021), on this data, using BART's original pre-training objectives (token masking, token deletion, text infilling and sentence permutation).We save the model checkpoint with the lowest validation loss after training for 50 epochs on two NVIDIA A40 GPUs.
COMET Training.We proceed to fine-tuning GD-BART on the large-scale ATOMIC-2020 dataset, using the same training method and hyperparameters as Hwang et al. (2021).Appendix A lists the 34 COMET relations used in this paper.

Intrinsic Evaluation
To evaluate the quality of GD-COMET, we construct a set of input sentences pertaining to 5 diverse cul- tures (Table 1).We sample 5 concepts for each facet and use facet-specific templates (Appendix B) to create 20 sentences for each culture.For each of COMET and GD-COMET, we use beam search to generate 5 inferences for each of the 34 dimensions and convert them to natural language statements using relation-specific templates based on prior work (Bosselut et al., 2019).The correctness of both inferences were judged by 10 graduate students, two students from each of the respective cultures.Annotators were asked to grade inferences along the following criteria on scale of 0 (worst) to 3 (best): 1 Cultural Relevance: The inference is factually accurate and reflects the values, customs, traditions, and societal norms associated with the given culture.
2 Stereotype Avoidance: The inference does not perpetuate stereotypes about the culture.
3 Linguistic Accuracy: The inference is grammatical, and the vocabulary and idiomatic expressions are appropriate in that culture.
The annotations yielded a substantial interannotator agreement with κ = 0.656 for COMET and 0.702 for GD-COMET, measured with average Cohen's Kappa (Cohen, 1960) across cultures.
Results.Table 1 reveals that GD-COMET consistently outperforms the standard COMET model.Specifically, GD-COMET excels in generating culturally aligned inferences across chosen diverse cultures, and is more likely than COMET to avoid biased assumptions.However, there is still room for improvement for South Korea and Nigeria.

Extrinsic Evaluation
Traditional benchmarks often fall short in testing models' knowledge and comprehension of diverse cultural contexts.To show GD-COMET's utility for downstream tasks, we evaluate on a multimodal task, GD-VCR (Sec 5.1).We develop a model inspired by VLC-BERT (Ravi et al., 2023a) that generates inferences and incorporates them into a vision and language (V&L) model (Sec 5.2).We show that GD-COMET improves the performance on GD-VCR upon an array of baselines (Sec 5.3) and demonstrate the inferences contributing to the performance gains (Sec 5.4).

Dataset
Visual Commonsense Reasoning (VCR; Zellers et al., 2019) is a benchmark for testing V&L models' ability to understand and reason beyond a visual scene.Each example consists of an image extracted from movies or TV series and a multiplechoice question about the actions or people depicted in the image.This dataset focuses solely on Western, primarily North American movies.
Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR; Yin et al., 2021) follows the same setup of VCR but extends to diverse regions.This evaluation-only dataset includes 328 images from movies and TV series in East Asian, South Asian, African and Western countries (See Appendix C).We follow the original setup and train our model on VCR before testing on GD-VCR.

Model (VLC-BERT with GD-COMET)
We take inspiration from VLC-BERT (Ravi et al., 2023a), that incorporated COMET inferences into VL-BERT (Su et al., 2019).Instead, we integrate GD-COMET as a source of contextualized cultural commonsense knowledge for GD-VCR.Figure 2 illustrates the model.We describe below VLC-BERT and where our model deviates from it.Knowledge Generation and Selection.VLC-BERT uses the question and the object tags as input to COMET.Instead of object tags, we generate an image caption using BLIP (Li et al., 2023) and extract noun phrases from the caption using SpaCy (Honnibal et al., 2020).We found that the noun phrases provide a more detailed description of the depicted activities within the image (e.g."family, burn" in Fig. 2).We additionally append a country tag to the input.During training on VCR, we use the tag "North America", the primary source of movies in the dataset.For the images in GD-VCR, we extracted country tags from Wikipedia.We use beam search to generate five inferences for each of the 34 dimensions.To select the most relevant inferences, we convert the inferences to natural language sentences using relation-specific templates and select the inferences that are the most similar to the question using SBERT embeddings (Reimers and Gurevych, 2019).
Overall Architecture.The generic input to VL-BERT for VCR is <question, answer tokens, image regions>.Following Ravi et al. (2023a), we embed each inference with SBERT and summarize them into a single token with a weighted average based on learned attention scores.Finally, we feed the output of the [CLS] token into a classifier to predict a score for each answer choice.We train the model using binary cross-entropy loss for 20 epochs on 4 NVIDIA RTX6000 GPUs.

Results
Table 2 compares our model's performance on GD-VCR with baselines that: (i) do not make use of commonsense knowledge (VL-BERT); (ii) generate inferences using GD-BART; and (iii) use COMET (VLC-BERT w/COMET).Note that the same signals (i.e., country tag and noun phrases) were used for the GD-BART and COMET baselines.We also include prior results reported using VisualBERT and ViLBERT for completeness.
VLC-BERT w/COMET modestly improves upon VL-BERT across most regions, with an overall improvement of 1.2 points in accuracy.This suggests that COMET provides some commonsense inferences that are universal.Conversely, GD-COMET shows a substantial improvement of nearly 5 points over VL-BERT and 4 points over VLC-BERT w/COMET.This highlights the effectiveness of incorporating GD-COMET for downstream tasks that require culture-specific knowledge across diverse regions.Furthermore, GD-BART performs less effectively than other methods, underscoring the importance of training on structured knowledge to generate contextually relevant responses.

Qualitative Analysis
Figures 3 presents several GD-VCR instances along with the models' predictions, and the inferences generated by COMET and GD-COMET for them.In Figure 3a, GD-COMET accurately associates a girl wearing henna in Somalia with marriage.In Figure 3b, it understands that folding palms during an Indian festival signifies a greeting or welcome.Finally, in Figure 3c, it recognizes that bowing in South Korea is a gesture of apology, leading to VLC-BERT w/ GD-COMET to be the only model that provides a correct answer.In contrast, COMET's inferences for this example are generic and irrelevant.These examples highlight GD-COMET's effectiveness in identifying the cultural context and dynamically generating culturally-relevant commonsense inferences across ATOMIC's relations.

Conclusion
This work challenges the current notion of universally applicable commonsense knowledge by introducing GD-COMET, a geo-diverse variant of COMET.GD-COMET can generate culturallynuanced commonsense inferences for a broad range of cultures.Our comprehensive evaluation con- firms the effectiveness of GD-COMET in incorporating and leveraging cultural cues.We view our work as a step towards developing more inclusive and culturally-aware AI systems.

Limitations
While GD-COMET represents a significant advancement in incorporating cultural commonsense knowledge into AI models, a few limitations need to be acknowledged.
First, the availability of comprehensive, highquality data remains a challenge in training culturally-aware models.While resources like CANDLE provide a step forward in curating diverse cultural knowledge, it is essential to note that merely capturing the existence of concepts within a culture is insufficient.Future efforts should aim to collect data that reflects the presence of certain concepts and encompasses how people perceive and interpret those concepts within their specific cultural contexts.This would require extensive data collection efforts that go beyond surface-level understanding, and delve into the nuances of cultural perspectives.
A second limitation is the availability of suitable benchmarks for testing models' knowledge and understanding of cultural variations.In particular, two such tasks, GD-VCR and MarVL (Liu et al., 2021), focus on vision and language, while Nguyen et al. (2023) proposes a cultural knowledge quiz.We hope to see more language-only datasets developed to go beyond testing models on knowledge about concepts from diverse cultures to understanding cultural nuances.

Ethics Statement
Despite being designed to be more culturally inclusive, GD-COMET runs the risk of unintentionally perpetuating biases present in CANDLE data.In particular, CANDLE might misrepresent cultures with stereotypes or underrepresent cultures.Addressing these concerns requires proactive measures such as identifying biases using methods such as Mehrabi et al. (2021) and mitigating them through filtering and additional data collection.
Additionally, the size of evaluation benchmarks means they don't always account for cultural variations within the same region.For example, GD-VCR images in the African region are concentrated in East Africa.Similarly, addressing this issue would require additional annotation efforts.

Figure 1 :
Figure 1: Inferences from COMET and GD-COMET for the sentence "PersonX eats a dutch baby", demonstrating lack of culture awareness in COMET.

Figure 2 :
Figure 2: A model using GD-COMET for GD-VCR.

Figure 3 :
Figure 3: Attention analysis of commonsense inferences generated by COMET and GD-COMET for testing samples in GD-VCR.

Table 1 :
Evaluation of COMET and GD-COMET inferences, judged by annotators from the respective cultures.

Table 2 :
Yin et al. (2021)he different models on the subset of each region in GD-VCR.We report the average across 3 runs (see Appendix D for the results of individual seeds).Results marked with * have been reported inYin et al. (2021).