Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction

Crosslingual conditional generation (e.g., machine translation) has long enjoyed the beneﬁts of scaling. Nonetheless, there are still issues that scale alone may not overcome. For instance, in the absence of additional context, a source query in one language may yield several translation options in another language. Only one translation could be acceptable however, depending on the translator’s preferences and goals. Choosing the incorrect option might signiﬁcantly affect translation usefulness and quality. We propose a novel method interactive-chain prompting — a series of question, answering and generation intermediate steps between a Translator model and a User model — that reduces translations into a list of subproblems addressing ambiguities and then re-solving such subproblems before producing the ﬁnal translated text. To check ambiguity resolution capabilities and evaluate translation quality, we create a dataset exhibiting different linguistic phenomena which lead to ambiguities at inference for four languages. To encourage further exploration in this direction, we release all datasets. We note that interactive-chain prompting , using eight interactions as exemplars, consistently surpasses prompt-based methods with direct access to background information to resolve ambiguities.


Introduction
Transformer Language Models (LM, Vaswani et al. 2017) pretrained on large corpora have achieved outstanding results in a variety of NLP benchmarks (Devlin et al., 2019;Brown et al., 2020).Scaling the number of parameters, the Interactive Learning with Implicit Human Feedback Workshop at International Conference on Machine Learning (ICML) 2023, Honolulu, Hawaii, USA.Copyright 2023 by the author(s).).However, for tasks such as commonsense and symbolic reasoning, where the solution requires multistep computation, or crosslingual conditional generation such as Neural Machine Translation (NMT), where there could be more than one plausible prediction for a given source sequence, scale alone may not be sufficient to achieve high accuracy (Rae et al., 2021;Ghorbani et al., 2022).
Chain-of-thought (Wei et al., 2022b) and least-to-most (Zhou et al., 2022) methods have demonstrated, by prompting a (large-)LM such as PaLM (Chowdhery et al., 2022), that breaking down a task into subproblems that are solved sequentially greatly improves the quality of the final prediction.Such methods demonstrate that producing intermediate sub-results that address specific aspects of a bigger problem significantly improves performance on tasks like arithmetic, math word problems, and symbolic manipulation.While studies have investigated the translation capabilities of PaLM with various prompting strategies (Vilar et al., 2022; Zhang et al., 2023), prompting large and general purpose LMs such as PaLM to identify and solve subproblems in crosslingual conditional generation tasks such as NMT has not yet been fully explored.
Our approach, Interactive-Chain-Prompting (INTERCPT), sequentially solves translation subproblems before generating a final translation prediction.As shown in Figure 1, we first detect ambiguities in translation queries, then we resolve these ambiguities via question-answer interactions, and finally we generate translations.INTERCPT departs from other prompt-based techniques that sequentially solve subproblems in two fundamental ways: (1) the subproblems are related but considerably different to the main task and (2) the solutions to subproblems requires interaction with another LLM.In this paper, we will look at how intermediate computation steps and interaction might assist overcome a typical problem in automated systems when a user's ambiguous query leads to a large number of viable and potentially inaccurate answers.In translation, for example, selecting the incorrect prediction has a significant impact on translation quality as illustrated in Fig. 2.
INTERCPT has several advantages.First, the LM is able to identify and ask questions about translation query ambiguities with only a few in-context exemplars and no finetuning.This is crucial since large corpora with specific target ambiguities, labels to classify each ambiguity subtypes (i.e.feminine/masculine for gender or formal/informal for formality) and context are not common and are typically low-resource.Then, without readily available context, we rely on the User to disambiguate translation queries.In the absence of additional background information or context, there are limited options to solve ambiguities.Interaction with the User stands as a logical way to collect clarifying information.This interaction also benefits from multiple computation steps where ambiguity resolution leads to a more precise final prediction.Finally, the question-answertranslation interaction improves transparency and makes it easier to debug translation systems since we can assess the reasoning chain that led to an error (Wu et al., 2022a).For NMT, there are two main questions to consider to make the most of out of intermediate computation steps: A) What subproblem are we trying to solve?Multistep reasoning tasks can often be explicitly decomposed into subproblems: ambiguity detection, disambiguation via Q&A and translation.For NMT, decomposing the translation task is not trivial.We assume in this work that our subproblems are ambiguities which arise when translating.As seen in Fig. 1, the first step in INTERCPT is to discover and resolve the translation ambiguity subproblem.We study five types of ambiguities: polysemous words, pronoun resolution, formality, gender-neutral names and neutral professions.Since datasets that cover multiple translation ambiguities and language pairs while providing context are rare, we create our own datasets (see Table 5 in Section C for an overview of other publicly available datasets).

B)
Where do answers to subquestions come from?When we apply least-to-most prompting to math word problems for example, the answers to subquestions can often be derived from the problem's text.It is not necessarily the case for NMT where the query may not contain enough context to resolve ambiguities.As seen in Fig. 2, English sentence 'S' does not contain enough information about "you" and "it".The incorrect prediction made by a model leads to large variations in translation quality scores.With more context, the model may have the necessary information to narrow down possible predictions.However, in industrial applications, translation queries are often too short (Badeka, 2016) or additional context is not existent.In this work, we automate interaction between a PaLM Translator model, that detects ambiguities, asks clarifying questions and translates, and a PaLM User model, that has access to context and answers questions.Both models engage in a multiturn dialog to zero-in on a narrower set of predictions.We argue that a type of question-answer interaction with a "user" is necessary to resolve ambiguous queries, especially when a user (1) is unfamiliar with the main task and may not possess the in-domain knowledge to choose from many model prediction options; (2) knows how to answer simple pointed questions about a query but may not be able or willing to decide and add appropriate context on the fly.This work marks Large-LM's potential to leverage a few in-context examples, to provide natural language answers and deliver results closer to a user's intent.

Interactive-Chain-Prompting (INTERCPT)
When interacting with a model, a user may have some well-conceived query in mind that is inadvertently underspecified.For example, a monolingual English speaker may be unaware that the pronoun "you" in a sentence can lead to formal or informal constructs in other languages.The model may therefore not receive additional information on the level of formality needed to adequately translate the text by this particular user.
A human translator, when asked to translate queries with "you", may want to first probe the user's latent context about the query by asking clarifying questions.In doing so, the human translator can use the answers to better align the translation to a User's request and context.Our method endows language models (LMs) with the ability to generate a similar chain of interactions between a Translator LM and a User LM as seen in Fig. 1.In real applications, it is expected that a human replaces the User LM.INTERCPT uses in-context exemplars to resolve ambiguities before completing the crosslingual conditional generation task that the model is originally asked to do.
It consists of a three step reasoning chain (See Fig. 1) with demonstrations that remain constant for each input query: 1.The first step is for identifying ambiguities.The prompt in this step contains the exemplars, showing multiple queries to translate and questions about each query's ambiguities.During inference, the Translator LM uses the prompt to generate a pointed question that identifies the specific ambiguity.2. The second step is for resolving ambiguities.The prompt in this step contains exemplars answering the question to the ambiguity subproblems in step one.The User LM answers each question using additional information from the provided context.In real life applications, we assume that a real user has similar background information about the text to be translated.3. The third step is for translating.Generated questions and answers are appended to the prompt in step 1 before the final translation is produced.Constant prompts in this step demonstrate how to translate in the specified target language using only details provided by the User LM and no-context.During inference, the Translator LM uses the prompt to generate the translation.

Ambiguity MT Datasets (AMBIGMT)
In this section, we introduce AMBIGMT, a dataset that covers four language pairs, for translations from English into French (en-fr), German (en-de), Spanish (en-es) or Japanese (en-ja) -18 sub-tasks in total.The parallel translation corpora contain five types of ambiguities: "it" resolution, formality, polysemy, gender 1 neutral names, neutral professions.Unless otherwise specified, all datasets include 1000 diverse samples for each {en-fr, en-de, en-es, en-ja} language pair extracted from Opensubtitles corpora (Lison 1 Please note that due to the lack of large translation corpora with various genders and the complexity in creating non-binary gender datasets, our data is limited to feminine and masculine.& Tiedemann, 2016).In Section C of the Appendix, we provide more details on datasets and describe the heuristics to identify ambiguities in each language.
"it" resolution data contains English sentences where the pronoun "it" does not clearly refer to a noun within the query.In English, the pronoun "it" is a singular, neuter and impersonal pronoun.In other languages, "it" may translate into gender specific pronouns (either feminine or masculine) or get dropped entirely from the sentence.The choice depends on what the pronoun refers to.To correctly translate, the model must first determine what "it" is.In the first example of Table 1 where the target language x is Spanish, knowing that "it" is a postcard, or una tarjeta postal in Spanish, disambiguates gender in the translation.While the gender affects two words in the target sentence, the wrong gender choice is not only qualitatively inappropriate but also decreases quality metrics (44 BLEU score drop from 100).
Polysemy is a dataset that contains words that have multiple meanings and the query is insufficiently informative to zero-in on a specific sense.The context uses the word within a sentence to provide the necessary background information.In the second example of Table 1 where the target language x is Japanese, the context shows that "head" is a verb.In conjunction with the noun "home", we disambiguate "head" as "to move in the direction of".In the absence of such context, "head" has various senses such as "upper part of the body", "side of a coin", "end of a hammer or tool", "a toilet on a boat", "to hit the ball with the head", "to lead".
Formality is a dataset where English queries contain the pronoun "you".In the target languages studied, "you" can be formal or informal.As seen in the third example of table 1 where the target language x is French, the speaker addresses the listener "you" as "Master Jedi" in the context, a title implying a formal style of politeness.The formality is ambiguous without the context and may impact the generated translation quality.Indeed, an incorrect choice in formality level changes "vous serez" to "tu seras" and "cela" to "c ¸a", decreasing BLEU scores by 58 points from 100.
Gender Neutral Names data includes queries where the name is gender neutral and ambiguous.The fourth example in Table 1 shows a query where the name "Blair" is gender neutral.In this dataset, we replace gendered pronouns in the English query by the token [pr] to remove hints about gender type.From the context, the speaker employs "her" and we can infer that a feminine pronoun "ihr" should be used in the translated German text.
Neutral Professions has 600 unique samples for two language pairs.This dataset is derived from the Translated Wikipedia Biographies dataset2 that covers {en-de, en-es}.In this dataset, the gender of typically gender-neutral professional designations is not clear from the English query alone.In the fifth example of Table 1, the context provides additional hints that the query is talking about "Margeret", also designated by the feminine pronoun "she".Resolving gender allows the model to correctly translate the list of professions in the query and potentially limiting the 70 points drop in BLEU scores from 100.

Experimental Setup and Results
In this section, we present the main cross-lingual generation results of INTERCPT for formality, "it" resolution and polysemy ambiguity resolution subtasks.
Setup.We use PaLM (Chowdhery et al., 2022), a 540Bparameter decoder-only LM pretrained on primarily Englishcentric data with ∼20% of the data obtained from nonparallel multilingual corpora.The generalist prompt template is composed of two formality, three polysemy and three "it" resolution exemplars.All prompt-based methods are 8-shot with the same source sentences S to translate and corresponding translated sentences A in the target language.Each target language has its own prompt template since A differs with every language.The simulated LM user is based on a single English-only 8-shot prompt template for all target languages.Example 4.1 shows the structure of an the LM user prompt exemplars for polysemy.A complete overview of all prompts and exemplars used in experiments can be found in Sections D.1 for the User LM and Sections D.2 for the generalist Translator LM.
Example 4.1.Given a Context (C), provide an Answer (A) to the Question (Q): S: about C: About 2% of the households are enumerated using the canvasser method.Q: Is "about" an adverb that means approximately, near or a preposition that means regarding, over, surrounding?A: "about" means approximately.
Baselines.Our main baselines were chosen to compare the cross-lingual generation abilities of large multipurpose LMs given interaction, context or no additional information.Please note that, to the best of our knowledge, there are no other baselines that (1) explore large multipurpose LM's capability on contextualized (or interactive) multilingual translation; (2) do not require finetuning on large datasets.
LLMWCXT, our strongest baseline, is the only PaLMbased prompt method that benefits from having all of the background information required to resolve ambiguities.LLMWCXT has a prompt with exemplars formulated as the one in example 4.2.In the example, references to you and it are directly accessible in context C.
LLMNOEXTRA is a PaLM-based prompt method that does not receive additional information to resolve ambiguities.This baseline is not only of interest for performance comparison and to evaluate model bias but also it can provide insights on the usefulness of additional background information to disambiguate queries.The structure of a LLMNOEXTRA exemplar is similar to example 4.2 without the context C. The model must translate the source sentence S in the target language without knowing details about "it" or the level of formality to employ for "you".
GTRANSLATE is a commercially available multilingual and multipurpose baseline queried using the Google Cloud Translation API 3 .This baseline allows us to set performance expectations that LLMNOEXTRA model should reach.
Example 4.2.Given context (C), Translate (S) from English to French: S: Are you sure that it is pretty?C: She was trying on a new hat.Looking at herself in the mirror, she asked her friend Isabelle.A: Es-tu certaine qu'il est beau?Metrics.Our evaluation includes the standard BLEU and BLEURT (Sellam et al., 2020) automatic translation quality metrics as well as additional measures that assess specific ambiguity resolution capabilities.For formality, we use a rule-based classifier to quantify generated sentence formality levels (F-Acc) in the target language.We discuss details of the heuristics in Appendix E. Note that the formality classifier is based on the formality data creation scripts that allowed us to automatically identify formal and informal sentences in the source corpus.For "it resolution", we found that the PaLM 62B-parameter model was surprisingly accurate at identifying translated sentence genders (G-Acc).As seen in Table 7 of Appendix E, PaLM 62B achieves Table 2. Translation results using an 8-shot generalist template that contains exemplars for formality, "it" resolution and polysemy ambiguity types.F-Acc = formality accuracy, G-Acc = gender accuracy, B@n = BLEURT@n.BLEU and BLEURT results for INTERCPT labelled with † are significantly better than all other systems based on pair-wise significance testing (Koehn, 2004)  97% and 93% accuracy in classifying samples of generated translations for Spanish and French respectively.For polysemy, we found that exact match metrics did not fully describe the performance of models.Whenever the model generated a synonym of the ground truth, the exact match metric would not consider the prediction correct.The LLM-NOEXTRA polysemy exemplars are a comma-separated list of synonyms.Our hit@n measures whether the ground truth exists in the first n generated words.For example, if the model outputs the list of Spanish words ["aproximadamente", "cerca de", "alrededor de", "casi", "más o menos"], for n = 3, hit@3 would return a match for a ground truth target "cerca de" and no-match for a ground truth target "casi".
To supplement the hit@n metric, we also report results of a new metric that we call BLEURT@n (B@n) which returns the highest BLEURT score of the first n generated word phrases.Since BLEURT captures the non-trivial semantic similarities between words using its contextual representations from BERT, we found that the metric better measures if correct synonyms were generated by the model.Note that we did not report the GTRANSLATE hit@n or B@n numbers since the API only provides single word outputs.
Discussion.Our test results for en-es, en-fr, en-de and enja are summarized in Table 2.We first notice that INTERCPT surpasses all other baselines.Surprisingly, LLMWCXT, even with all the necessary background to resolve ambiguities, significantly lags behind INTERCPT on F-Acc.for formality, G-Acc.for "it resolution" and both hit@3 and B@3 for polysemy.This results suggests that the multistep computation approach of fist resolving the ambiguity subproblems and then generating text has an advantage over other baselines.BLEU scores are also 2-3 points higher while BLEURT scores are only slightly higher.This suggest that INTERCPT generates sentences syntactically much closer to the ground truth while conserving the correct semantics.

Translator Model scale (# parameters in billions)
Figure 3. INTERCPT enables large LMs to solve ambiguity subproblems in cross-lingual generation.The multistep disambiguate-translate capability is an emergent ability that is reached at higher parameter scales.section 5.1.In Subsection 5.2, we see that templates with examplars covering many ambiguities (i.e.generalist) performs on par with ones with single ambiguity examplars (i.e.specialist).In Subsection 5.3, we find that interactive translation is an ability that emerges with scale.Subsection 5.4 shows that User LM scale is important, with best scores reached with the 540B-LM model.We study our method's failure modes in Subsection 5. 5 showing that where we can improve INTERCPT further.Finally, we provide evidence that interaction helps better mitigate bias in Subsection 5.6.

How does interaction generalize?
In Table 3, we provide translation test results on two heldout datasets that are described in Section 3: (1) Gender Neutral Names and (2) Neutral Professions.We use the same generalist prompt template as in Section 4 with exemplars that cover only formality, "it" resolution and polysemy.Specifically, our exemplars for both the Translator LM and the User LM do not contain exemplars to resolve the gender for a person's name or profession.We observe that on the Gender Neutral Names dataset INTERCPT performs best on BLEU and BLEURT and is much more able to resolve ambiguities with 6 to 10 points G-Acc improvements over LLMWCXT.On the Neutral Professions data, where test samples are taken from a different domain (Wikipedia biographies instead movie scripts), LLMWCXT and IN-TERCPT have similar performances.It is possible that LLMWCXT benefits from additional sentences in the context to better determine the style of the output.Nonetheless, INTERCPT provides a 1-2 point increase on G-Acc.

Are specialist better than generalist prompts?
So far, we have studied a generalist 8-shot template covering three different types of ambiguities with at most three exemplars per ambiguity.In Fig. 4, we present results of specialist template that only covers one type ambiguity at the time (either all formality or all polysemy).Interestingly, specialization does not seem to provide much additional benefit in resolving ambiguities as evidenced by F-Acc, Hit@3 and B@3 results that are on par and often lower than the generalist approach.However, the specialist template does have a higher BLEU score, implying greater syntactic alignment with the target translation when more ambiguity-specific exemplars are added.

Are interactive generation abilities emergent?
We show in Fig. 3  ).We conjecture that the emergent behavior of INTERCPT is due to a better ability to ask questions and incorporate answers before generating final prediction.

How important is User LM parameter scale?
While the User LM allows us to automate the evaluation of interactivity for cross-lingual generation, it is not clear if the quality of the answer to the Translator LM questions impact performance.We hypothesize that a larger User LM model would provide higher quality answers and allow the Translator LM to better generate translated text.Fig. 5 shows that, when the Translator LM is a 62B PaLM model, a higher parameter User LM improve overall performance.It is therefore possible that answer quality has a significant impact on translation quality and that human-generated answers can further improve overall performance.

When is context better than interaction?
Formality (FR) In this section, we provide analysis that describes common areas of improvement for generalist interactive-chain prompting.We first isolated test samples for French and Spanish for four ambiguities (formality, "it" resolution, neutral professions and gender neutral names) where the BLEURT scores were less than or equal to LLMWCXT scores.We then randomly sampled 50 interactions and manually analysed the interaction chains (query, question, context, answer, translation).This led us to five types of errors: (1) wrong question, when the Translator LM asked a question not related to the ambiguity; (2) wrong answer, when the User LM did not provide correctly disambiguate; (3) many ambiguities, when the query had multiple unresolved ambiguities or the User LM answer also contained ambiguities; (4) limited context, when the context was not sufficiently informative to resolve ambiguities; (4) style or other, when generated translated text had discernible differences with the ground truth.Fig. 6 shows that the majority of errors are from wrong User LM answers for formality and "it" resolution.This partially confirms our hypothesis in Subsection 5.4.For tasks involving unseen ambiguities, the majority of errors come from the Translator LM with 68% to 78% of sample chains having the wrong question or noticeable differences in generated translated text style or form.We provide examples of interaction chains for each type of error in Table 4.

Can interaction help solve NLG bias issues?
Gender bias is a common phenomenon in automated NMT systems (Borkan et  Another GPT-3 model simulates the user and generates answers while conditioned on ground-truth clarification questions.In contrast, our prompt-based method only needs few-shot demonstrations.Further, our simulated user does not rely on ground-truth clarification questions to provide an answer, which could be more realistic for a number of applications (including QA, text simplication, code generation).

Conclusion
We propose interactive-chain prompting (INTERCPT), a prompt-based interactive multistep computation technique that first resolves cross-lingual ambiguities in the input queries and then performs conditional text generation.We have created and released a new datasets that covers five ambiguities: formality, "it" resolution, polysemy, gender neutral names and neutral professions for four different language pairs.Empirical results show that INTERCPT outperforms other prompt-based techniques that have access to all background information and context to directly resolve ambiguities.We find that INTERCPT MT is an emergent property of parameter scale that allows Large LMs to perform interactive generation tasks while other prompt-based techniques exhibit flattening scaling curves.INTERCPT can be considered a step forward more efficiently interacting with machine learning systems.

A. More details on INTERCPT interactive steps
To make link between interaction steps in Figure 1, the process overview in Section 2, the appendix code and templates, we add the following: Step 1: The Translation LM asks a question on ambiguity using language specific methods in Appendix D.2.It takes as input the English text to Translate en text and outputs the question Q.For example, if we want to translate English to Spanish with a generalist template, we can use spanish generalist translator interactive(...).
Step 2: The User LM answers the question Q generated in step 1 using any method in Appendix D.1.It takes as input en text and the context C (ctx in the code) and outputs the answer U .For example, we can use generalist simulated user context(...).
Step 3: If no other ambiguity is detected, the Translation LM translates using language specific methods in Appendix D.2.It takes as input the English text to Translate en text, the question Q, and the answer U and outputs the translation A.

B. Link between Chain-of-Thought and Least-to-Most prompting
In this section, we add a few more words on the link between INTERCPT, Chain-of-Thought (CoT) and Least-to-Most (L2M) prompting.CoT performs better than the baseline that has access to the whole information in the problem statement (similar to having context).The behavior is attributed to the sequential solving of subproblems (in our case ambiguity) and a multistep computation (in our case interaction).LLMWCXT has access to more information but does not involve multiple computation steps to solve a subproblem while INTERCPT does

C. More details on AMBIGMT ambiguity datasets
In this section, we provide additional information on what the datasets contain and how they were created.As mentioned in Section 1, to the best of our knowledge, datasets that cover a large set of ambiguities for multiple language pairs do not exist.
We provide an overview of publicly available datasets in Table 5. Upon manual inspection of samples from other public datasets, we found that translation queries were often (> 50%) unambiguous since the translation query contained enough information and did not need to rely on the provided context.We inspected 200 samples from AMBIGMT and found that only 3% of queries did not need context to disambiguate the linguistic phenomena.

C.1. Dataset statistics
We present in Table 6 the data statistics for AMBIGMT.For polysemy, the total senses per word is the number of different definitions or meanings found for a specific source English word.Each ambiguity is well balanced across classes formal/informal or feminine/masculine.The Neutral Professions dataset is derived from the Translated Wikipedia Biographies dataset 4 that only covers {en-es, en-de} language pairs.Table 6.AMBIGMT data statistics of each type of class and language pair.Form = formal, Inform = informal, Mas = Masculine, Fem = Feminine, res = resolution, Prof = Profession.In this section, we present the steps, tools and heuristics used to detect ambiguities.For polysemy, formality, "it" resolution, gender neutral names, we extract the data from OpenSubtitles corpora and neutral professions from Translated Wikipedia Biographies.The source data that was used consists of parallel sentence level pairs.We first detect a sentence that has a specific ambiguity and extract the context by taking three to five preceding English sentences, depending on sentence size.For Polysemy, the context is an English sentence that contains the polysemous word that will be translated.The code and datasets are released here.
• Create a list of English words.
• Compute the number of definitions per word without counting definitions with synonym overlap.
• Extract polysemous words (w e ) with more than three definitions and a word length greater than four.
2. For each Polysemous English word w e , extract a list l x = {w x1 , . . ., w xN } of possible word translations using the Google Cloud Translation v2 API, where x ∈ {es, fr, de, ja} is the target language.3.For each Polysemous English word w e and each target language x ∈ {es, fr, de, ja}: • Find a sentence that contains the word w e in the OpenSubtitle dataset.
• If the parallel sentence contains one of the translated word w xi ∈ l x from step 2 and no other translated word, keep the English sentence as context.

C.2.2. FORMALITY
Each language has specific formality rules.For Japanese, we direct the reader to our public code: https://anonymous.4open.science/r/interactive_chain_prompting.We provide the following list of steps to create the formality dataset for Spanish, French and German: 1. Find a sentence that contains "you" or "your" and that has word count less than 20, in the English OpenSubtitle corpus.2. Select parallel sentences for each target language x ∈ {es, fr, de, ja} that meet the following criteria.
3. If x == es, check the following in parallel Spanish sentence (all checks are initialized to FALSE): • If all verbs finish by "s", "ste" or "os", then is verb informal = TRUE.
• If any pronouns is "usted", then is pronoun formal = TRUE.
• If any determinants is "su", then is determinant formal = TRUE.
• is informal = is verb informal and is pronoun informal and is determinant informal.
• is formal = is pronoun formal and is determinant formal.
4. If x == fr, check the following in parallel French sentence (all checks are initialized to FALSE): 5 See example in https://www.nltk.org/howto/wsd.html • If any verbs finish by "x", "s" or "ons", then is verb informal = TRUE.
• If any verbs finish by "ez", then is verb formal = TRUE.
• If one of the pronouns is "vous", then is pronoun formal = TRUE.
• If one of the pronouns is "tu", then is pronoun informal = TRUE.
• If one of the determinants is in ["vos","votre"], then is determinant formal = TRUE.
• is informal = is verb informal and is pronoun informal and is determinant informal.
• is formal = is verb formal and is pronoun formal and is determinant formal.
• is informal = is pronoun informal.
• is formal = is pronoun formal.We provide the following list of steps to create the "it" resolution dataset.The steps apply to all languages: 1.For each English sentence in the OpenSubtitle dataset, keep sentences where the word"it" exists.
• Using a dependency parser, if "it" is expletive6 , skip sample.
• In the parallel Spanish, French, German or Japanese sentence, if the sentence does not contain a verb and a gendered pronouns, skip sample.• Keep gender label.
2. For each sample, create context by keeping the preceding three to five English sentences, depending if word count is above 20.

C.2.4. GENDER NEUTRAL NAMES
We provide the following list of steps to create the gender neutral names dataset.Please note that for simplicity we used binary genders.Genders beyond female and male will be left for future work.The steps apply to all languages: 1. Compile a list L gnn of gender neutral (unisex) names • Collect a list of names with gender statistic such as the percentage of people with the name who identify as female or male 7 .• Keep the names that are used in approximately equal proportions (unisex) with at least a female or male proportion above 40%.
2. For each gender neutral name ∈ L gnn , find a sentence that contains the name in the English sentence and keep the corresponding parallel sentence in Spanish, French, German or Japanese.

D. Prompt templates used in experiments
In this section, we discuss the main prompt templates used in experiments.This includes INTERCPT Translator generalist and specialist templates to ask questions about ambiguities and exemplars to translate in French, Spanish, German or Japanese.It also includes INTERCPT User generalist and specialist templates to answer questions given a context.We also provide the prompt templates for the LLMWCXT experiments where we use context and the same exemplars to translate in French, Spanish, German or Japanese.Please note that we have normalized special characters for simplicity.The German and Japanese templates as well as Spanish and French templates with special characters can be found in our public code and data repository.In the python methods listed below, en text is the input query, ctx is the context, question is the question from the Translator model and anwer is the answer from the User model.

D.1. INTERCPT Simulated User Prompts
The 8-shot generalist Simulated User prompt template is the same for all languages and is provided in code block listing 1.
1 def generalist_simulated_user_context(en_text, question, ctx): 2 """Generalist Simulated user has access to context and answers the question."""7 S: abstract 8 C: For the international community is not an abstract concept, it consists of us ourselves .9 Q: Is "abstract" to consider theoretically, to extract something, or a summary, or an adjective?10 A: "abstract" is an adjective that modifies the word "concept".

D.2. INTERCPT Generalist Prompt Templates for each target language
The 8-shot Spanish generalist Translator prompt template is the same for all test ambiguity data and is provided in code block listing 4.

Figure 2 .
Figure 2. Translation queries with multiple possible predictions.Correctly solving subproblems around ambiguities with you and it greatly affects the BLEU (Papineni et al., 2002) translation metric.
""[web] Given a Context (C), provide an Answer (A) to the Question (Q): return templated_input Listing 2. INTERCPT Formality Specialist Simulated User Prompt TemplateThe 8-shot polysemy specialist Simulated User prompt template is the same for all languages and is provided in code block listing 3.1 def polysemy_simulated_user_context(en_text, question, ctx): 2 """Polysemy simulated user has access to context and answers the question."""""[web] Given a Context (C), provide an Answer (A) to the Question (Q):

Listing 3 .
INTERCPT Polysemy Specialist Simulated User Prompt Template

1 6 else: 7 # 6 else: 7 #
def spanish_generalist_translator_interactive(en_text, question=None, answer=None): 2 """Translation model asks questions and uses answers to translate""" 3 if answer == None: 4 # Ask questions 5 instructions = "[web] Given sentence 'S' to translate to Spanish, ask clarifying questions 'Q' to clarify ambiguities or multiple senses:" Translate given answer 8 instructions = "[web] Given answer 'U' to question 'Q', provide the Spanish translation 'A' of sentence 'S'.Provide the best answer:" return templated_input Listing 4. INTERCPT Spanish Generalist Translator Prompt TemplateThe 8-shot French generalist Translator prompt template is the same for all test ambiguity data and is provided in code block listing 5.1 def french_generalist_translator_interactive(en_text, question=None, answer=None):2"""Translation model asks questions and uses answers to translate""" [web] Given sentence 'S' to translate to French, ask clarifying questions 'Q' to clarify ambiguities or multiple senses:" Translate given answer 8 instructions = "[web] Given answer 'U' to question 'Q', provide the French translation 'A' of sentence 'S'.Provide the best answer:" templated_input = """

Table 1 .
AMBIGMT data examples for each ambiguity for target language x. ∆ B is the BLEU performance drop from 100 if the highlighted ambiguity is resolved incorrectly.

Table 3 .
Translation results on unseen ambiguity subproblems using the Gender Neutral Names data and with added unseen domain using the Neutral Professions data.INTERCPT results labelled with † are significantly better with p = 0.05.
Here, we provide key insights on INTERCPT.We show that INTERCPT better generalizes to unseen ambiguities in Sub- for each prompt template the effects of scaling PaLM parameters on the performance of formality, "it" resolution and polysemy for Spanish (ES), French (FR), German (DE) and Japanese (JA) target languages.Please note that while we vary the parameter count (8B, 62B and 540B) of the Translator LM, the User LM is a 540B parameters PaLM model for all experiments.The plots provide

Table 4 .
Examples of interaction chain errors.
Figure 6.Error analysis.rez = "it" resolution, Prof. = Neutral profession, Names = Gender Neutral Names al., 2019; Stanovsky et al., 2019; Saunders & Byrne, 2020).Even when there are explicit gender pronouns in the input query or in the context, NMT systems generated text tends to be masculine when translated into languages with grammatical gender (Stanovsky et al., 2019; Saunders & Byrne, 2020; Stafanovičs et al., 2020; Wang et al., 2022).To measure gender bias, all generated trans-Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction Our approach discovers preferences and background knowledge about an input query in the source language and more flexibly adapts translations according to a user's natural language response.The interaction is similar to Conversational AI systems where user utterances influence generated outputs.Task or goal oriented conversational AI systems (Konstantinova & Orasan, 2013; Gao et al., 2018; Hussain et al., 2019) are typically deployed to answer knowledge-based questions, seek information or solve basic queries (e.g.making reservations, purchase an item).To the best of our knowledge, our work is the first to explore conversational interaction in cross-lingual generation.Resolving ambiguities by asking for clarifications has been a recent topic of research, for QA and conversational search systems (Lee et al., 2019; Aliannejadi et al., 2019; Zamani et al., 2020; Dhole, 2020; Wang & Li, 2021; Wu et al., 2022b).Departing from such methods, INTERCPT does not produce sentences from a preset list of questions but is generated from a large LM without constrain.Concurrently to our work, Krasheninnikov et al. (2022) explored finetuning GPT-3 to generate clarifying questions and provide answers using human generated data from AmbigQA (Min et al., 2020) for open-domain QA.
(Green et al., 2014;Santy et al., 2019)using Large LMs is a technique that has garnered increasing attention of late.Works on GPT-3(Vaswani et al., 2017)and PaLM (Chowdhery et al., 2022) show competitive n-shot BLEU translation results on WMT.The prompt demonstrations are populated with n random sentence pairs chine interactivity has assisted translators in writing translations by displaying automated word suggestions that update incrementally(Green et al., 2014;Santy et al., 2019).The approach however is limited by drop-down menu options and requires a certain level of sophistication from the user in the target language.

Table 5 .
Other MT datasets that contain specific linguistic phenomena and provide context.en = English, de = German, fr = French, ru = Russian, zh = Mandarin Chinese, ja = Japanese.
6. Keep samples if is formal != is informal, use 'formal' label if is formal or 'informal' label if is informal.7.For each sample, create context by keeping the preceding three to five English sentences, depending if word count is above 20.
• If the English sentence has gendered pronouns, skip the sentence if multiple genders are detected.• If the English sentence has no gendered pronouns, use a Part-of-Speech tagger 8 on the corresponding parallel sentence in Spanish, French, German or Japanese and skip the sentence if multiple genders are detected.• Keep gender label.3. Replace gendered pronouns with [pr] in the source English sentence to remove simple clues about the name's gender.4. For each sample, create context by keeping the succeeding three to five English sentences, depending if word count is above 20.