Modeling Cross-Cultural Pragmatic Inference with Codenames Duet

Pragmatic reference enables efficient interpersonal communication. Prior work uses simple reference games to test models of pragmatic reasoning, often with unidentified speakers and listeners. In practice, however, speakers' sociocultural background shapes their pragmatic assumptions. For example, readers of this paper assume NLP refers to"Natural Language Processing,"and not"Neuro-linguistic Programming."This work introduces the Cultural Codes dataset, which operationalizes sociocultural pragmatic inference in a simple word reference game. Cultural Codes is based on the multi-turn collaborative two-player game, Codenames Duet. Our dataset consists of 794 games with 7,703 turns, distributed across 153 unique players. Alongside gameplay, we collect information about players' personalities, values, and demographics. Utilizing theories of communication and pragmatics, we predict each player's actions via joint modeling of their sociocultural priors and the game context. Our experiments show that accounting for background characteristics significantly improves model performance for tasks related to both clue giving and guessing, indicating that sociocultural priors play a vital role in gameplay decisions.


Introduction
"Most of our misunderstandings of other people are not due to any inability to... understand their words... [but that] we so often fail to understand a speaker's intention." -George Armitage Miller (1974) Certain pragmatic inferences can only be interpreted by individuals with shared backgrounds.
⋆ Equal contribution. Steps 1-5 outline high-level gameplay tasks. THE CLUE GIVER targets the words fall and drop, giving the hint slip. THE GUESSER misinterprets slip as a piece of paper, guessing reciept and check.
For example, what researchers call fun may not be fun for kindergartners. Theories from sociolinguistics, pragmatics, and communication aim to explain how sociocultual background affects interpersonal interaction (Schramm, 1954)especially since variation occurs across several dimensions: class (Bernstein, 2003;Thomas, 1983), age (Labov, 2011), gender (Eckert andMcConnell-Ginet, 2013), race (Green, 2002), and more. Rigorously modeling how culture affects pragmatic inference on all axes is understandably challenging. The board game Codenames Duet offers a more restricted setting of turn-based word reference between two players. In each round, THE CLUE GIVER provides a single-word clue; then THE GUESSER must interpret this clue to select the intended word references on the game board. Ideal inferences come from the players' common ground-the set of shared beliefs between them (Clark, 1996). In practice, however, a player's behavior can be idiosyncratic. Each player has knowledge and experience that shape how they interpret clues and make guesses. When players' backgrounds differ, they may be more likely to misinterpret their partner, as seen in Figure 1.
Inspired by the above, we model the role of sociocultural factors in pragmatic inference with a new task and a series of ablation experiments. First, we describe the CULTURAL CODES dataset of cross-cultural Codenames Duet gameplay, with relevant background information from the players' demographics, personalities, and political and moral values ( §3). Then, we deconstruct each action in a game into a distinct modeling task, taking inspiration from work on cross-cultural pragmatics ( §4). Finally, we model each task with/without sociocultural priors, and highlight how player background improves model performance ( §6). Our dataset and code is released publicly at https: //github.com/SALT-NLP/codenames 2 Related Work Cross-Cultural Pragmatics and NLP Pragmatics describes the nonliteral meaning that comes from context and social inference (Purpura, 2004;Thomas, 1983;Hatch et al., 1992). Although some pragmatic categories are universal (e.g., politeness), they can be expressed differently in sociocultural contexts (Taguchi, 2012;Shoshana et al., 1989;Gudykunst and Kim, 1984). When an intended meaning is misinterpreted, this is known as 'pragmatic failure' (Thomas, 1983)-often the result of misaligned reference frames or differences in common ground (Stadler, 2012;Crawford et al., 2017). Especially relevant to Codenames are communal lexicons, where common ground manifests in shared community vocabulary (Clark, 1998). Another axis of difference is between low/highcontext cultures (Hofstede, 2001); high-context cultures rely more on shared background. Pragmatics also differs by age (Saryazdi et al., 2022), region, ethnicity, politics, andclass (Thomas, 1983), as does theory of mind (Fiske and Cox, 1979;Miller, 1984;Shweder, 1984;Lillard, 1998Lillard, , 1999. Outside of work on politeness (Sperlich et al., 2016;Fu et al., 2020), sarcasm (Joshi et al., 2016), and irony (Karoui et al., 2017), the NLP subfield has not closely considered cross-cultural pragmatics. While there is work on understanding the role of individual culture-for example, learning demographic word vectors (Garimella et al., 2017), identifying deception/depression (Soldner et al., 2019;Loveys et al., 2018), or improving translation (Specia et al., 2016)-modeling cross-cultural pragmatic inference in communication remains a challenge (Hershcovich et al., 2022).
Still, a culture-free pragmatics has played a central role in various NLP tasks, from instructionfollowing (Fried et al., 2018), image captioning (Andreas and Klein, 2016), persona-consistent dialogue (Kim et al., 2020), and summarization (Shen et al., 2019). Much of this work is grounded in Bayesian models of cognition (Griffiths et al., 2008), with models like Bayesian Teaching (Eaves Jr et al., 2016), Naive Utility Calculus (Jara-Ettinger et al., 2016;Jern et al., 2017), and the Rational Speech Acts (RSA) model (Goodman and Frank, 2016;Franke and Jäger, 2016) that integrate language, world knowledge, and context to explain ideal pragmatic reasoning (Noveck, 2018) and grounded reference (Monroe et al., 2017). Instead of modeling socioculture in isolation, we model pragmatic inference, highlighting the role of culture in general interpersonal interaction.

The CULTURAL CODES Dataset
This study has been approved by the Institutional Review Board (IRB) at the authors' institution. The purpose of the CULTURAL CODES dataset is to understand how measurable social factors influence dyadic communication in English. By collecting relevant participant background information, we aim to understand how these factors affect linguistic reasoning in a collaborative reference game.

Codenames Duet Game Overview
Codenames Duet is a collaborative variant of Codenames (Vlaada, 2015) designed for 2 players. The players share a 5 × 5 board of 25 common words. Each player has a distinct (but sometimes partially overlapping) map from words on the board to the following objectives: goal, neutral, and avoid. One player's map is hidden from the opposing player. The objective of the game is for both players to guess all of their partner's goal words without guessing any of their partner's avoid words, as doing so results in an immediate loss.
CULTURAL CODES uses an adapted version of Codenames Duet. With each turn, players alternate between the THE CLUE GIVER and THE GUESSER roles. To begin the turn, THE CLUE GIVER (1) selects one or more associated goal words as targets. Next, THE CLUE GIVER (2) provides a single word clue that relates to the associated target(s). This clue is displayed to THE GUESSER, along with the number of targets she should find. The THE CLUE GIVER also (3) provides a justifying rationale for the clue, describing the relationship between the clue and the target(s). This rationale is not displayed to the partner. Using the clue and the number of target words THE GUESSER (4) guesses targeted words. For each guess, THE GUESSER (5) provides a justifying rationale for the guess. After ending the turn, players alternate roles and continue until all goal words are selected for both sides, or players are eliminated for guessing an avoid word. An overview of roles is illustrated in Figure 1. In §4, we formalize actions (1)-(4) as distinct modeling tasks.

Selecting Board Game Words
All experiments are run on a strategically filtered subset of the 400 words from Codenames Duet. We select the 100 most abstract and semantically ambiguous board game words to elicit diverse responses from players. Since the polysemy (Ravin and Leacock, 2000) of a word-the number of related senses it includes-predicts the expected diversity of player responses, we retain only nouns with two or more senses in WordNet (Miller, 1992). Next, we rank polysemous words with Brysbaert et al. (2014)'s concreteness list, selecting the 100 most abstract words according to the mean of their human concreteness scores (finalized list can be found in Appendix A.) When a player starts a game, we initialize the board with a random subset of 25 words from the filtered 100. For each player, 9 words are randomly mapped to goal, 3 are avoid, and 13 are neutral.

Gameplay Data
To collect gameplay data, we modified an opensource implementation of Codenames Duet, 1 automatically pairing individuals who visited the game website. To source players, we relied on Amazon's Mechanical Turk. We provided MTurkers with an initial instruction video detailing rules and how to play. To be eligible for the task, Turkers had to get ≥ 80% questions right on a qualifying quiz about Codenames rules and gameplay (Appendix D.1). Average game length was around 17.4 minutes, and MTurkers were paid $2.50 for every game.
Gameplay Attributes For each completed turn, we collected the following game state information from THE CLUE GIVER. Elements marked in gray were hidden from THE GUESSER.
Clue: THE CLUE GIVER's clue c (e.g. c could be "transport" for the target "car").
Target Word(s): (Hidden) The target words t n (e.g. "car") that THE CLUE GIVER intended THE GUESSER to guess.
Target Word(s) Rationale(s): (Hidden) A free-text phrase r n , that describes the relationship between each target word t n and the clue c (e.g. "a car is a mode of transport").
To summarize, each turn from THE CLUE GIVER results in a clue c and at least one target-rationale pair (t n , r n ). On the other hand, we collect the following for THE GUESSER.
Guesses: The guesses g n that THE GUESSER selected for THE CLUE GIVER's clue c. Rationale for Each Guess: A free-text phrase r n that relates the guess g n to the clue c Manual inspection revealed a wide range of rationales. To prevent models from exploiting variance, we instructed GPT-3 to normalize text, removing pronouns and determiners. 2 We provided few-shot examples of reformatted rationales and manually inspected normalized outputs. Additional preprocessing information can be found in Appendix B.

Sociocultural Priors and Worker Diversity
Because we aim to understand the role of sociocultural priors on gameplay, we asked Turkers to complete the standardized surveys below, which cover three broad dimensions: demography, personality, and morality.
Demographic Data (Figure 2) comes from both the annotation UI and in the task's qualifying questionnaires. In the UI, we asked Turkers for their numeric age, their country of origin, and whether English is their native language. These were required features, so we will denote them as Demo Req . In the qualifier, we included an extended demographic survey with age range, level of education, marital status, and native language (Appendix D.2.1), which we will denote as Demo All . We find that our annotator demographics are moderately diverse, mirroring Moss et al. (2020). Reported gender across annotators are evenly split: 53% identify as women, 47% identify as men, and 0% as other. Additional details are in Figure 2 and Appendix D.2.1. Figure 3) surveys also offer insight into interpersonal interactions. We administer the Big 5 Personality Test (John et al., 1991), measuring a range of personality dimensions on a 5 point  Moral and Political Leaning ( Figure 4) also influences decision making processes. Therefore, we asked annotators to self-report their political leaning (liberal, conservative, libertarian, etc). While political leaning captures broad elements of annotator values, Haidt and Graham (2007) Table 1: Tasks associated with a turn in Codenames. THE CLUE GIVER starts by selecting information to encode (in the form of a clue), and THE GUESSER decodes clues through guesses. In our experiments, we evaluate models with and without sociocultural priors. Task formulation (generation/classification) is underlined.

General Dataset Statistics
In total, we collect 794 games, with a total of 199 wins and 595 losses. 3 Games lasted an average of 9.7 turns, resulting in 7,703 total turns across all games. THE CLUE GIVER targeted an average of 1.24 words per turn. For all collected games, both players provided Demo Req . For 54% of games, both players completed all background surveys; for the remaining 46% of games, at least one player completed all surveys. There were no games with no background information.

Tasks and Modeling
To investigate the role of sociocultural factors in pragmatic inference, we propose a set of tasks (Table 1) associated with THE CLUE GIVER ( §4.1) and THE GUESSER ( §4.2) roles. Concretely, we formalize each action into a conditional generation problem instead of classification, since outputs in 3 Some players went inactive before a game was completed. We only collect games that are reasonably long: greater than the 90 th percentile of incomplete games, or ≥ 7 turns. CULTURAL CODES are unconstrained: actions and outputs depend on a changing board state.

Selecting Target Words
To start, THE CLUE GIVER identifies target word(s) (1) on a board, which are later used to construct a target clue for the inference. Clues will target salient words, where salience is at least partially determined by the speaker's cultural background (Wolff and Holmes, 2011). Each set of targets is a subset of the remaining goal words for a given turn (targets ⊆ goal)-we enforce this restriction in our annotation UI.

Giving a Clue
After selecting target words, THE CLUE GIVER must generate a common clue word across the targets (2). Here, THE CLUE GIVER must select a prototypical word across the targets. Because cultural background plays a role in inference (Thomas, 1983), a clue should lie in players' common ground. Furthermore, the clue word should not lead the  guesser to pick a avoid n i or neutral e i word, since these words can end the game or turn (see §3.1). Therefore, we also include avoid and remaining neutral words in our input.

Framing the Target Rationales
The relationship between the target and clue word plays a critical role in communication-how information is framed with respect to common ground can influence pragmatic success (Crawford et al., 2017). To this end, we model THE CLUE GIVER's framing of the rationale r for a specific target word t (3), connecting the target t to the clue (c.f., §3.3). Because the framing is constructed in relation to every target word (if multiple are provided), we also encode all targets in the input.

Selected Guesses
With the clue word, the THE GUESSER pragmatically infers THE CLUE GIVER's targets, selecting a sequence of corresponding guesses (4). For this task, we model the sequence of all selected guesses, regardless of correctness. We input all unselected 4 4 Note that goal/avoid/neutral words differ across players. A goal word for one player can be avoid for another; game states are asymmetric. A clue from THE CLUE GIVER may also target a goal word for the THE GUESSER. As long as one does not guess a avoid word from the opposing player, the  scores and fastText cos similarities between the reference and generation, since outputs must be semantically close to or exactly match the reference labels. We find that Morality and All maximize performance over our metrics.
words at the start of each turn for THE GUESSER, along with the provided clue. Like with Target Word Selection, guesses must be a subset of the unselected words (guesses ⊆ unselected); we enforce this during annotation.

Framing Guess Choice
Finally, THE GUESSER also provides framing rationale for their respective guesses, framing clues with respect to their guess (5).

Predicting Pragmatic Success
So far, our tasks focus on replicating elements of a game turn: the Selected Guesses task ( §4.2.1), for example, models both incorrect and correct guesses. However, we also wish to understand if an entire turn sequence results in a successful inference; differences in cross-cultural inferences can result in pragmatic failures (Thomas, 1983). We formulate this as binary classification. Importantly, we only consider a guess correct if it is intentional. A guess is intentional if and only if the clue giver listed it as a target. If THE GUESSER selects a goal word that is not a target word, we count it as "incorrect." Like with guess generation, we encode unselected words in the input. Because we are not predicting the guess itself, we include game continues. See §3.1.   target and rationale from THE CLUE GIVER.

Augmenting with Sociocultural Priors
We hypothesize that players' backgrounds influence Codenames gameplay. To this end, we encode background player information for each task. For each dimension described in §3.4, we encode an attribute/answer pair (e.g. age: 22) for each survey question. Then, we prepend all attributes to the encoded strings for each outlined task ( §4), using a unique token to delimit attributes for THE CLUE GIVER and THE GUESSER.
in socio = {BOS, GIVER, Clue Giver Attr:A , GUESSER, Guesser Attr:A } + in If a player did not respond to a specific attribute, we replace the attribute/answer pair with None. From our sociocultural priors ( §3.4), we have 5 ablations: Demo Req , Demo All , Personality, Morality, and All (concatenating and modeling all ablations). We additionally use no priors as a baseline, using in instead of in socio to test our hypothesis.

Experiment Setup
Baselines and Dataset Splits For generation baselines, we use two Seq2Seq models: T5 (Raffel et al., 2020) and BART (Lewis et al., 2020). We optimize the associated language modeling objective across our tasks. Additionally, we experiment with two retrieval baselines for all generation tasks: (1) randomly selecting a generation from the train set and (2)  For each task, we split clue givers into 80-10-10 train/val/test, since all tasks depend on initial clue giver choices. Importantly, a single clue giver's data is not distributed across splits, since clue givers may reuse clues/strategies.

Generation Results & Discussion
Including cultural priors improves modeling performance across all tasks. For generation problems, T5 generally outperforms BART, and our retrieval baselines lag behind more complex models. Finally, we conduct a qualitative analysis of 20 random samples from each task. (Table 2), we find that selecting guesses is an easier modeling task than picking target words, likely because the input for selecting a guess contains the clue word. Intuitively, selecting target words is more arbitrary than selecting a guess from a clueespecially since our generation task does not enforce guess correctness. Our models reflect this observation. Guess Selection has R-1 scores that are, on average, twice as good as Target Word Selection (Target 34 vs. Guess 66). Furthermore, Guess Selection only requires demographics (Demo Req ) to maximize performance, unlike Morality for Target Words. Regardless, both tasks see R-1 increase by ≈ 2 points over no prior baselines.

Picking Targets and Guesses From our results
Looking at model outputs between the None and Morality, we observe that models generate words like Well/Grace instead of Death/Poison and vice versa, depending on player background.
Generating a Clue for Targets Moving to our clue generation models, we again find that including sociocultural priors improves model performance (Table 3). Highest R-1 scores (26.54) occur when using Morality as a prior, resulting in a ≈ 2 pt. R-1 and 4 pt. cos-similarity increase when compared to a no prior baseline. We also suspect that selecting target words and generating a hint are interrelated processes: annotators are likely thinking about clues/targets in parallel. Therefore, the same Morality prior results in maximized performance.
While there are themes related to Morality in clue differences for a target word (accident → death vs. lucifer; or fair → equal vs. good), we also find that generations are more specific given sociocultural priors. Consider these generated target → clue pairs ✓ with and ✗ without priors: Each ✓ example generates a clue that relies on shared cultural background: specifically, knowing that cricket is a sport; that James Bond is a popular character; and that the Undertaker is a wrestler. More details can be found in Appendix C, Table 6.

Clue Generation Errors Across Sociocultural
Subtypes Despite jointly modeling cross-cultural information, our performance is far from perfect. Generating successful clues is a core element of Codenames; however, our exact match accuracy on clue generation is only ≈ 26%. To understand errors, we sample 100 generated clues from the Clue Generation Task, and identify errors and differences between (socioculturally) generated clues and the ground truth label.
For 43 samples, we notice that sociocultural priors have no effect on clue generation; the output is identical to the no prior model for the given target word. In these instances, we suspect that our models fail to exploit common ground between a giver/guesser, yielding the same clue as without sociocultural priors. Upon further analysis, we observe that these errors occur frequently (37 samples) when both the clue giver and guesser are white or from North America. Because these demographics are already over-represented in our dataset, we suspect that the model simply ignores over-informative sociocultural priors.
Errors also occur because clues are over (20 instances, e.g. "guevera" instead of "overthrow") or underspecified (13 instances, e.g. "supernatural" instead of "monster") compared to the gold clue. In 21/33 of these instances, there is a demographic mismatch between the clue-giver and guesser: the clue-giver and guesser do not share race/country demographics. In contrast to having no effect, we suspect that models mispredict the common ground between guesser/giver. We also judge 18 generation errors to be of similar specificity to the target word-prefixes/suffixes of the gold label-or completely unrelated to the gold clue (6 instances).

Rationalizing Targets and Guesses
Beyond generating target words and guesses, we ask models to explain how a target or guess is related to a clue word (e.g. James Bond is a movie character). Again, we find that providing contextual priors improves performance (Table 4). For Target Rationale Generation, models see maximized performance when all priors are included, while Guess Rationale generation sees improvements for Morality.
Like with Clue Generation, we find that improvements in Guess Rationale are from increased specificity (e.g. "actors are cast" → "actors are part of a cast"; "money is center" → "money is the center of everything"). While qualitative differences are clear for Guess Rationale, Target Rationale results are more subtle: improvements stem from minor variations in the type of framing ("a kind of" vs. "a type of") used by the annotator. Additional generations can be found in Appendix C, Table 7.
Classifying Pragmatic Failure We find that classification performance across each architecture is maximized when using sociocultural priors during training (Table 5). While BERT sees reduced improvement (an increase of only +0.02 F-1 over a no-prior baseline), XLNet and RoBERTa see maximum increases of +0.07 and +0.10 respectively. Both XLNet and RoBERTa see these improvements across the same Personality setting. Sociocultural priors improve performance across mirroring and evaluating pragmatic inference.
A Word on Word Vector Baselines Surprisingly, retrieving nearest words using a word vector approach (fastText) performs poorly for both Clue and Guess Generation (Tables 2 & 3). We suspect that pretrained vectors fail to capture sociocultural inference in word association tasks.

Conclusion
Language is grounded in rich sociocultural context. To underscore this context, we propose a setting that captures the diversity of pragmatic inference across sociocultural backgrounds. With our Codenames Duet dataset (7K turns across 156 players), we operationalize cross-cultural pragmatic inference. Across our experiments, we detail improvements in mirroring/evaluating inferences when using sociocultural priors. Our work highlights how integrating these priors can align models toward more socially relevant behavior.

Cross-Cultural Inference Beyond Codenames
Our work explores sociocultural pragmatic inference in a very limited setting, using a core vocabulary of just 100 words. Despite this limitation, we find significant diversity in our dataset; furthermore, our models successfully capture these diverse inferences. While a limitation of our work is its focus on a single setting, we expect domains outside of Codenames to see similar variance. Understanding and highlighting miscommunication in dialog-due to culture-dependent misinterpretation-is one such extension. These domains are likely much nosier than Codenames; we urge future work to further investigate them.
Spurious Correlations across Sociocultural Factors Across all tasks but one (Target Rationale Generation §4.1.3), jointly modeling all sociocultural priors does not result in the highest performing model. Because our sociocultural factors already correlate with each other ( §3.4), we suspect that modeling all features may be redundant, adding spurious correlations and resulting in overfitting. Improved modeling methodology and careful regularization may address these issues; we leave these experiments for future work.

Bigger Models and Task Specific Modeling
Currently, we evaluate small Seq2Seq models due to computational constraints; however, evaluation of 0-shot and few-shot performance on larger language models (e.g. GPT-3) is necessary. Given the changing state of the Codenames board-along with evidence that LLMs struggle with theory-ofmind-esque perspective taking (Sap et al., 2022)our dataset can serve as a challenging benchmark for sociocultural understanding. However, successfully encoding game state into prompts for LLMs may require experimentation. Finally, our current task formulation and modeling setup are straightforward: we simply encode all information in-context and do not assume recursive reasoning like in RSA (Goodman and Frank, 2016). Future work can explore these directions.
Human Evaluations Our evaluation is limited to automatic metrics and qualitative analysis. Evaluating cross cultural generation depends on the evaluator's own culture. Each generation depends on the player's sociocultural background; finding evaluators who match the player may be prohibitive.

Ethics
Broadly, our work models user background to determine the choices they make. While we focus on a fairly harmless setting (Codenames), our operationalization can be used in harmful ways (e.g. tracking and modeling user behavior without consent). Future work that uses sociocultural information should only be applied to settings where there is no foreseeable harm to end-users.
Furthermore, learning sociocultural associations can introduce positive and negative stereotypes; documenting and reducing harmful stereotypes is an important avenue for future work. Finally, we emphasize that our work is not evidence for linguistic determinism: sociocultural variation in language can influence but not determine thought.

A Finalized Codenames Word List
We sample from the following list of 100 words: luck, grace, soul, fair, life, pass, revolution, change, charge, degree, force, code, genius, compound, time, wake, plot, draft, ghost, play, part, spell, well, point, link, mass, disease, sub, state, alien, space, mine, ray, millionaire, agent, bond, unicorn, figure, war, cycle, boom, sound, trip, centaur, death, club, crash, angel, cold, center, spring, round, date, press, cast, day, row, wind, fighter, embassy, beat, leprechaun, comic, pitch, mount, march, fall, undertaker, green, switch, strike, king, superhero, capital, slip, lead, check, lap, mammoth, air, match, spy, roulette, contract, witch, stock, light, drop, spot, novel, vacuum, cover, scientist, tag, conductor, field, racket, poison, ninja, opera. B Reformatting Rationales using GPT-3 Some annotators wrote verbose rationales (I think fall happens after you slip), while other annotators were more succinct (fall after slip). To prevent models from learning grammar variation across annotators, we normalize our text using GPT-3. We use the following prompt, using hand-written few-shot examples. Some of the examples are unchangedwe include them in the prompt to demonstrate positive examples to the model.
Normalize the text, removing determiners like "the" and "a" at the start of a sentence, along with any pronouns. Correct spelling and grammar mistakes. If possible, the final text should be formatted with the clue first and the target last or the target first and the clue last.
clue: "sub" target: "sandwich" text: "you can make a sub, which is a type of sanwich" output: "sub is a type of sandwich" clue: "die" target: "cliff" text: "you may die if you fall off a cliff" output: "die if fall off a cliff" clue: "explosion" target: "boom" text: "it makes sound" output: "explosion makes boom" clue: "superman" target: "superhero" text: "most famous superhero" output: "superman is most famous superhero" clue: "night" target: "club" text: "i love night club" output: "night club is a kind of club" clue: "horn" target: "air" text: "an air horn is a type of horn" output: "air horn is a type of horn" clue: "ivy" target: "poison" text: "poison ivy is a well known plant" output: "poison ivy is a well known plant" clue: "month" target: "march" text: "march is a month" output: "march is a month" clue: "{clue}" target: "{target}" text: "{text}" output: "

C Example Generations
Here, we include example generations for a subset of our tasks, illustrating the influence of sociocultural factors on generated Codenames gameplay.

C.1 Clue Generation
Below, we highlight more clues generated with-/without sociocultural priors. Note how some of the without generations are euro-centric: space → nasa, {revolution, king} → war; adding priors creates more specific clues. However, this isn't always true: target words {pass, check} → leads to poker instead of overtake when conditioned on priors. We suspect that the average player in our pool is not aware of how {pass, check} are associated with poker, resulting in a more generic generation.  tennis tennis has racket a racket is used in tennis tennis uses a racket day month day is month month has many days 30 days in a month Table 7: Example Rationales for Clues, with/without background priors. With priors, we observe that rationales become more specific, mentioning explicit relations between the target and clue.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?

Section 8
A2. Did you discuss any potential risks of your work?

Section 9
A3. Do the abstract and introduction summarize the paper's main claims?
Abstract + Introduction A4. Have you used AI writing assistants when working on this paper?
Left blank.
B Did you use or create scientific artifacts?
Section 3 for our introduced dataset, and we cite all baseline models (Section 5) B1. Did you cite the creators of artifacts you used?

Section 5
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
Appendix Section E B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? Yes, Section 9 B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? Appendix E Yes, Section 5 C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? Appendix D and E The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.