Interpreting Indirect Answers to Yes-No Questions in Multiple Languages

Yes-no questions expect a yes or no for an answer, but people often skip polar keywords. Instead, they answer with long explanations that must be interpreted. In this paper, we focus on this challenging problem and release new benchmarks in eight languages. We present a distant supervision approach to collect training data. We also demonstrate that direct answers (i.e., with polar keywords) are useful to train models to interpret indirect answers (i.e., without polar keywords). Experimental results demonstrate that monolingual fine-tuning is beneficial if training data can be obtained via distant supervision for the language of interest (5 languages). Additionally, we show that cross-lingual fine-tuning is always beneficial (8 languages).

Yes-no questions are questions that expect a yes or no for an answer.Humans, however, often answer these kinds of questions without using a yes or no keyword.Rather, they provide indirect answers that must be interpreted to reveal the underlying meaning (see examples in Table 1).Indirect answers are used to ask follow-up questions or provide explanations for negative answers (Thompson, 1986), prevent wrong interpretations (Hirschberg, 1985), or show politeness (Brown and Levinson, 1978).This is true at least in English and the eight additional languages we work with.Note that question answering is usually defined as finding an answer to a question given a collection of documents.On the other hand, interpreting indirect answers to yes-no questions is defined as mapping a known answer to its correct interpretation.
Many NLP problems were initially investigated in English (Kann et al., 2019).Even though yes-no questions and indirect answers have been studied for decades (Hockey et al., 1997;Green and Carberry, 1999), previous efforts to date have predominantly focused on English (Section 2).In this paper, we tackle this challenging problem in eight additional languages: Hindi (hi), Korean (ko), Chinese (zh), Bangla (bn), Turkish (tr), Spanish (es), Nepali (ne), and Persian (fa).
This paper focuses on multilingual interpretation of indirect answers to yes-no questions.Doing so opens the door to several applications.For example, dialogue systems could avoid inconsistencies and contradictions (Nie et al., 2021;Li et al., 2022).
Consider the examples in Table 1.Follow-up turns such as How long have you felt like that?, What else is required for your support?, and I can't wait to see you at the concert (one per example) would be puzzling and probably frustrating to hear.
The main contributions are as follows:1 1.A distant supervision approach to collect yesno questions and direct answers along with their interpretations.We use this approach to collect training data in five languages.2. Evaluation benchmarks in eight languages in which no resources exist for interpreting indirect answers to yes-no questions.

Experimental results showing that training
with the direct answers obtained via distant supervision is beneficial to interpret indirect answers in the same language.4. Experimental results expanding on (3) and showing that multilingual training is beneficial, even when no additional training data for the language of interest is available.

Related Work
Question Answering Researchers have targeted, among others, factual questions (Morales et al., 2016), why questions (Lal et al., 2022), questions in context (Choi et al., 2018), questions over procedural texts (Tandon et al., 2019), and natural questions submitted to a search engine (Kwiatkowski et al., 2019).The problem is defined as finding answers to a given question (often within a set of documents).Unlike this line of work, interpreting indirect answers to yes-no questions is about determining the underlying meaning of answers-not finding them.
Yes-No Questions have also been studied for decades (Hockey et al., 1997;Green and Carberry, 1999).Recent work includes large corpora such as BoolQ (Clark et al., 2019) (16,000 yes-no questions submitted to a search engine) and extensions including unanswerable questions (Sulem et al., 2022).
Yes-no questions have also been studied within the dialogue domain (Choi et al., 2018;Reddy et al., 2019).These dialogues, however, are synthetic and constrained to a handful of topics.Circa (Louis et al., 2020), Friends-QIA (Damgaard et al., 2021), and SWDA-IA (Sanagavarapu et al., 2022), also explore yes-no questions in dialogues (crowdsourced, modified TV scripts, and phone conversations).We are inspired by their interpretations and use their corpora in our experiments (Section 5).All computational works on yes-no questions to date are in English.We are the first to target eight additional languages.Crucially, we do by exploring distant supervision and cross-lingual learning; our approach does not require tedious manual annotations.
Previous works have also focused on better finetuning approaches for cross-lingual transfer (Joty et al., 2017;Zheng et al., 2021).Others use machine translations to create synthetic multilingual training data or translate multilingual test benchmarks into English (Artetxe et al., 2023).In addition, several efforts have been made on multilingual Question Answering tasks.Wu et al. (2022) present a siamese semantic disentanglement model for multilingual machine reading comprehension.Riabi et al. (2021) propose a multilingual QA dataset leveraging question generation models to generate synthetic instances.Other work develops a Spanish SQuAD dataset based on a Translate-Align-Retrieve method (Carrino et al., 2020).
In this paper, we avoid translations.Instead, we use distant supervision.Our approach (a) is conceptually simple, (b) only requires unannotated corpora and rules for yes-no questions and direct answers in the languages of interest, and, importantly, (c) obtains statistically significantly better results than cross-lingual learning from English.

Obtaining Training Data
For training purposes, we work with six languages.Specifically, we work with existing English corpora (Section 3.1) and corpora obtained in five additional languages via distant supervision (Section 3.2).As we shall see, multilingual transfer learning is successful even when no examples in the language of interest are available (Section 5.3).

Existing English Corpora
There are three English corpora that include yes-no questions, indirect answers, and their interpretations: Circa (Louis et al., 2020), SWDA-IA (Sanagavarapu et al., 2022) and Friends-QIA (Damgaard et al., 2021).Table 2 presents statistics.Context refers to the text around the question and indirect answer (i.e., dialogue turns before and after).Circa was created by asking crowdworkers to write 34k yes-no questions that fit 9 scenarios (e.g., friends talking about food) and indirect answers.It does not include context.We note that the frequency of Middle interpretation is much lower than the other two interpretations.SWDA-IA includes 2.5k yes-no questions and indirect answers from SWDA (Stolcke et al., 2000), a telephone conversation dataset.It includes context (three turns before and after).Unlike Circa, questions and answers come from transcriptions of (almost) unconstrained conversations.Friends-QIA includes 5.9k yes-no questions and indirect answers derived from Friends, a TV show.Unlike Circa and SWDA-IA, questions and answers in Friends-QIA were manually modified to facilitate the task.For example, they remove some yes and no keywords in answers (e.g., yes, yeah, yep) but not others (e.g., of course, absolutely).Further, they add relevant information from the context to questions and delete interjections (e.g., Hey!) among others.
The three datasets do not consider the same interpretations (Circa: 8 options, SWDA-IA: 5 options, Friends-QIA: 6 options).We cluster them into Yes, No and Middle following our definitions (Section 4, Appendix A) for comparison purposes.Because Circa and SWDA-IA are the only corpora with "naturally occurring" questions and answers, we choose to not work with Friends-QIA.

Distant Supervision for New Languages
We follow a distant supervision approach to collect multilingual training data for interpreting indirect answers to yes-no questions.The only requirements are (a) (relatively) large unannotated corpora in the languages of interest and (b) rules to identify yes-no questions, direct answers, and their interpretations in each language.We found that native speakers can write robust rules after few iterations.Source Corpora We made an effort to identify relevant corpora for the eight languages we work with but could only do so in five languages.The other three languages (Bangla, Nepali, and Persian) are spoken by millions, and we could certainly find digital texts in these languages.But (a) creating a large collection of dialogues and (b) splitting noisy transcripts into turns and sentences are outside the scope of this paper.Further, doing so would raise copyright and ethical considerations.
For Chinese, we select (a) NaturalConv (Wang et al., 2021), a synthetic dialogue dataset written by crowdworkers covering several topics, and (b) LCCC-base (Wang et al., 2020), posts and replies extracted from Weibo, a Chinese social media platform.For Spanish, we choose Call-Friend (Canavan and Zipperlen, 1996), a corpus consisting of unscripted telephone conversations.For Hindi, we collect questions and answers from Twitter using their API.For Korean, we select a question-answering corpus from AI Hub,2 which includes civil complaints and replies from public organizations.For Turkish, we identify FAQ (i.e., Frequently Asked Questions) and CQA (i.e., Community Question Answering) datasets from MFAQ (De Bruyn et al., 2021).All of these corpora are used in accordance with their licenses.
These corpora are diverse not only in terms of language.Indeed, they include different genres (written forum discussions, dialogue transcripts, question-answer pairs, etc.) and domains (social media, informal conversations, etc).Identifying Yes-No Questions We define rules based on lexical matching to identify yes-no ques- tions from the aforementioned corpora in each language.The Hindi rules are as follows.A tweet contains a yes-no question if it: • contains a question mark and any of these bigrams: ky a (do you), ky (do we), ky (will this), ky (does this, does this ever), (can this happen); • does not contain these words: (where), ky (why), (how), (who), (whose), (which), (or), º (when); • has between 3 and 100 tokens; and • does not (a) contain links, @mentions, #hashtags, or numbers or (b) come from unverified users, retweets, or replies to tweets.Identifying Direct Answers The next set of rules identifies which yes-no questions are followed by a direct answer.We have rules that identify direct answers and their interpretations based on yes and no keywords.We complement these rules with rules to discard some answers that cannot be reliably interpreted regardless of keywords.The rules for Hindi are as follows.A reply tweet to a yes-no question is a direct answer if it: • contains yes keywords: (Yes), (yes), (yes), я (yes), (sure), (correct), Ê (definitely), yes, yeah, sure, of course, 100%; or no keywords: (No), (no), (don't), (not), no, never, n't (We include a few English keywords since people code-switch between these languages); • does not contain links, #tags, or more than one @mention (i.e., only replies to the user who asked the yes-no question); • does not contain question marks; and • has between 6 to 30 tokens.

Appendix B details the rules for other languages.
Analysis The result of the rules for distant supervision consists of yes-no questions, direct answers, and (noisy) keyword-based interpretations.We prioritize precision over recall, as we need as little noise as possible for training purposes.We estimate quality with a sample of 200 instances per language (Table 3).The rules to identify yes-no questions are almost perfect across four out of five languages (Precision: 0.98-1.00).We identify thousands of yes-no questions in all the languages, and many of those (35.2%-65.3%)are followed by a direct answer that we can interpret with our rules.The ratio of yes and no interpretations varies across languages, and the precision of the rules to interpret direct answers is high (0.93-0.97) in all languages except Hindi (0.65).We believe that (a) the ratios of yes and no depend on the domain of the source corpora and (b) the low precision in Hindi is due to the fact that we work with Twitter as opposed to more formal texts.Note that all the questionanswer pairs identified via distant supervision are interpreted with yes or no.Regardless of the quality of the data, whether it is useful in the training process is an empirical question (Section 5).

Benchmarks in New Languages
We are the first to work on interpreting answers to yes-no questions in languages other than English (Hindi, Korean, Chinese, Bangla, Turkish, Spanish, Nepali, and Persian).We set to work with questions and answers written in the languages of interest to avoid translationese (Koppel and Ordan, 2011), so we create new benchmarks with 300 question-answer pairs per dataset for each language (Chinese and Turkish: 600 samples; other languages: 300 samples).For the five languages [hi] Q: ky ú e Ú e ?(Are we born only to be cheated?)A: Ú u Ú e । (Unfortunately we are born in that era.) Interpretation: Yes for which there are large unannotated corpora and we have built rules for (Section 3), we select yes-no question-answer pairs without a direct answer (i.e., those identified by our rules as yes-no questions but not followed by a direct answer).We collect 300 question-answer pairs for the other languages from Bangla2B+ (Bhattacharjee et al., 2022, Bangla), Kantipur Daily (Nepali), 3 and LDC2019T11 (Mohammadi, 2019, Persian).Annotation Guidelines We manually annotate the interpretations of indirect answers to yes-no questions in the eight languages using three labels: • Yes: the answer leans towards yes, including probably yes, yes under some conditions, and strong affirmative answers (e.g., Absolutely!).• No: the answer leans towards no.
• Middle: Yes or No interpretations do not apply.Our interpretations are a coarser version than those used in previous work.Appendix A presents a mapping, and Appendix C details the guidelines.
Table 4 presents examples.In the Hindi (hi) example, the yes interpretation relies on the negative sentiment of the answer.In the Turkish (tr) example, the yes interpretation requires commonsense knowledge about fasting, which includes not drinking water.Similarly, the Persian (fa) example requires commonsense (seeing a mountain implies clear weather).In the Korean (ko) example, the answer provides an alternative, thereby rejecting the 3 https://ekantipur.com/request.In the Spanish (es) example, the answer affirms the inquiry without using yes or similar keywords.In the Nepali (ne) example, the answer is interpreted as middle since it mentions laws rather than discussing an opinion.In the Chinese (zh) example, the answer suggests using a can, implying that they do not have any water bottles in the car.In the Bangla (bn) example, the answer confirms that the questioner has a fever (despite it is lower).Inter-Annotator Agreements Native speakers in each language performed the annotations.We were able to recruit at least two annotators for five languages (zh, hi, es, tr, and bn) and one annotator for the rest (ne, ko, fa).Inter-annotator agreements (linearly weighted Cohen's κ) are as follows: Turkish: 0.87, Hindi: 0.82, Spanish: 0.76, Bangla: 0.73, and Chinese: 0.67.These coefficients are considered substantial (0.6-0.8) or (nearly) perfect (>0.8) (Artstein and Poesio, 2008).We refer the reader to Appendix D for a detailed Data Statement.Dataset Statistics and Label Frequency Table 5 presents the statistics of our training datasets and benchmarks in the new languages.Since we obtain the training datasets via distant supervision (Table 3), they only include yes-no questions with direct answers and thus their interpretations are limited to be yes or no.For the benchmarks, we select instances from each language randomly.This allows us to work with the real distribution of interpretations instead of artificially making it uniform.As shown in Table 5 (bottom), yes is always the most frequent label (34.3%-54.2%),but the label distributions vary across languages.

Experiments
We develop multilingual models to reveal the underlying interpretations of indirect answers to yesno questions.We follow three settings: (a) crosslingual learning to the new language (after training with English, Section 5.1); (b) monolingual finetuning via distant supervision with the language of interest (Section 5.2); and (c) multilingual finetuning via distant supervision with many languages (Section 5.3).Following previous work (Section 2), we use the question and answer as inputs.We conduct our experiments with two multilingual transformers: XLM-RoBERTa (Conneau et al., 2020) and XLM-Align (Chi et al., 2021).Both are obtained from HuggingFace (Wolf et al., 2020) and were pretrained on hundreds of languages.We split our benchmarks into validation (20%) and test (80%), and report results with the test split.Cross-lingual learning was conducted with a model trained and validated exclusively in English (with Circa and SWDA-IA).Monolingual and multilingual fine-tuning use the validation split in each language for hyperparameter tuning, The test set is always the same.We will discuss results with XLM-RoBERTa, as it outperforms XLM-Align by a small margin.Appendix F and G detail the results with XLM-Align and the hyperparameters.

Cross-Lingual Learning from English
We start the experiments with the simplest setting: training with existing English corpora (Circa and SWDA-IA) and evaluating with our benchmarks.We consider this as a cross-lingual learning baseline since the model was neither fine-tuned nor validated with the new languages.6) and the instances obtained with distant supervision in the language we evaluate with.%∆F en indicates the improvement compared to training with English only (Table 6).
Training with the new language is always beneficial and it only requires defining a handful of rules.
sults with SWDA-IA for Hindi are likely due to the mismatch of domains (SWDA-IA: phone conversation transcripts, Hindi: Twitter).

Fine-Tuning with the New Language
We continue the experiments by training models with English corpora (gold annotations, Circa and SWDA-IA) and the data obtained via distant supervision for the language of interest (noisy data, Section 3.2 and Table 5).We adopt the fine-tuning methodology by Shnarch et al. (2018) to blend training data from English corpora and the additional instances obtained via distant supervision.Briefly, we start the training process (first epoch) with the concatenation of the English data and data for the new language.Then, we reduce the data for the new language by a ratio α after each epoch.We choose the English corpora based on the best combination from Table 6 for each language.We found that reducing the data for the new language (as opposed to the English data) yields better results.We believe this is because the former: (a) is noisier (distant supervision vs. human annotations) and (b) does not include any middle interpretations.Table 7 presents the results.Recall that additional data is only available in five languages.The results show that blended training with English and the new language is always beneficial.The improvements (%∆F en ) range from 2% to 19%.They are the largest with the languages for which we have the most additional data (Table 3): Turkish (9%) and Chinese (19%).Note that despite the additional Hindi data is noisy (0.67 Precision, Table 3), we observe improvements (2%).

Fine-Tuning with Several Languages
We close the experiments exploring whether it is beneficial to train with languages other than En-glish and the language of interest.In other words, we answer the following questions: Does multilingual fine-tuning yield better results?
We use the same blending strategy as in monolingual fine-tuning (Section 5.2) but adopt a greedy approach to find the best combination of additional languages to train with.We start with the model trained with English and the data for the language of interest if available (Table 7).Otherwise, we start with the model trained with English (Table 6).Then, we add data from an additional language (one at a time) and select the best based on the results with the validation split.We continue this process until including additional languages does not yield higher results with the validation split.
Table 8 presents the results.The technique is successful.Compared to monolingual fine-tuning (Table 7), multilingual fine-tuning is always beneficial (%∆F = 2-28).More importantly, the improvements with respect to cross-lingual learning (Table 6) are even higher across all languages (%∆F en = 2-53).For five out of eight languages, the improvements are at least 11% and statistically significant (McNemar's test, p < 0.05).Note that multilingual fine-tuning obtains significantly better results with two of the languages for which we could not find source corpora to use distant supervision: Nepali (17%) and Bangla (22%).We hypothesize that the low gains with Persian might be due to the fact that it is the only one using a right-to-left script.Which languages are worth training with?Some of the language combinations that are worth blending with are surprising (Table 8).For example, Chinese (zh) is beneficial for Hindi (hi), Turkish (tr) and Spanish (es).This may be due to the fact that Chinese is one of the languages for which we collect the most data via distant supervision.Similarly, Turkish (tr) is useful for Korean (ko). Figure 1: Language similarities between the eight languages we evaluate with (boldfaced) and each of the languages we use distant supervision with (hi, ko, zh, tr, and es).Underlining indicates that a language is worth blending with (Table 8).The more similar the languages, the more likely to be useful, although there are exceptions (e.g., Korean and either Turkish or Hindi).
In order to analyze these surprising results, we sort all language pairs by their similarity (Figure 1).We define language similarity as the cosine similarity between their lang2vec vectors (Littell et al., 2017), more specifically, we use syntactic and language family features from the URIEL typological database.Generally speaking, the more similar a language the more likely it is to be useful for multilingual fine-tuning.There are only a few exceptions (e.g., Korean with Hindi and Turkish).

Examples of Errors
Table 9 presents questions and answers for which our best model fails to identify the correct interpretation.In the Chinese (zh) example, the no interpretation could be obtained by contrasting urban area and airport.However, the model lacks such commonsense knowledge.Similarly, in the Bangla (bn) example, a wife being afraid of dogs most likely indicates that there are no dogs in the household.In the Korean (ko) example, the answer provides both yes and no for different days of the weekend, but the model appears to consider only Sunday.In the Spanish (es) example, the question has a negation thus the answer ought to be interpreted as yes.Answers in both the Turkish (tr) and Hindi (hi) examples include a polar distractor that confuses the model (tr: no longer yet the author has a favorite game; hi: stop yet the answer ought to be interpreted as nobody can stop paying taxes.The answer in Nepali (ne) indicates it could (or could not) happen while the model only takes it as affirmation.The answer in Persian (fa) is a rhetorical question with negation, thus the correct interpretation is yes.

A Note on Large language Models
Large Language Models (LLMs) obtain impressive results in many tasks (Mishra et al., 2022).Researchers have also shown that LLMs can solve problems even with malicious prompts (Webson and Pavlick, 2022) [es] Q: y no se lo ha operado todavía?(And you haven't had surgery yet?) no yes A: bueno cuando yo he estado Esta segunda vez (Well I've been here for the 2nd time.) [tr] Q: en sevdiginiz bilgisayar veya mobil oyun var mı? (Do you have a favorite computer or mobile game?) no yes A: pc oyunlarına bagımlılıgım kalmadı, lakin 3 senedir arada bir oynadıgım dead by daylight var.(I am no longer addicted to PC games, but there is Dead by Daylight, which I have been playing occasionally for 3 years.) [ne] Q: e dd u ÊÊ ?(Is there a discussion that after going to the government, the case can be dropped?)yes middle A: । (That could be.) [fa] Q: ‫؟‬ ‫میکنه‬ ‫کارو‬ ‫این‬ ‫کسی‬ (Does anyone do this?) no yes A: ‫؟‬ ‫نمیکنن‬ (Don't they?) [hi] Q: ky Ú º Ú Ú ?(Should we stop paying taxes?) yes no A: º Ú Ú , Ú । (Even if you stop breathing, you will still have to pay taxes.) [bn] Q: Aapanar baiet ku ku r Aaeq, sYar?(Sir, do you have a dog in your house?) yes no A: Aamar igiß ku ku r vo y pay.(My wife is afraid of dogs.)Table 9: Examples of error interpretations to instances from our benchmarks, made by our best model (Table 8).may understand.LLMs do not outperform our best model (Table 8).First, note that open-source LLMs are "pretrained on Latin or Cyrillic scripts" (Touvron et al., 2023), thus they cannot process most of the languages we work with.Second, closed-source LLMs such as ChatGPT are likely to have been pretrained with the data in our benchmarks, so data leakage is an issue.We randomly selected 30 instances for each language from our benchmarks and fed them to ChatGPT via the online interface using the prompt in Appendix H.The best model for each language (Table 8) outperforms ChatGPT by 16.4% on average with these instances (F1: 54.25 vs. 63.13).Regardless of possible data leakage, we acknowledge that ChatGPT obtains impressive results in a zero-shot setting.

Conclusions
We have tackled for the first time the problem of interpreting indirect answers to yes-no questions in languages other than English.These kinds of answers do not include yes or no keywords.Their interpretations, however, are often either Yes or No. Indeed, Middle accounts for between 12.7% and 33.7% depending on the language (Table 3).
We have created new evaluation benchmarks in eight languages: Hindi, Korean, Chinese, Bangla, Turkish, Spanish, Nepali, and Persian.Additionally, we present a distant supervision approach to identify yes-no questions and direct answers.Training with this new, noisy, language-specific data is always beneficial-and significantly better in five languages.Cross-lingual learning from many languages to a new language is the best strategy.This is true even for the three languages for which distant supervision was not explored because of difficulties finding large unannotated corpora (Bangla, Nepali, and Persian).Our approach successfully learns to interpret indirect answers from direct answers obtained via distant supervision.
Our future work includes exploring more robust multilingual fine-tuning (Zheng et al., 2021).We also plan to explore applications of the fundamental research presented here.In particular, we are interested in improving dialogue consistency and avoiding inconsistencies by ensuring that generated turns are compatible with the correct interpretation of indirect answers to yes-no questions.

Limitations
We double-annotate the benchmarks in five out of the eight languages.We are not able to do so in the other three languages (Bangla, Nepali, and Persian) since we are unable to recruit a second native speaker.While great care was taken, including single annotators double and triple checking their work, we acknowledge that not reporting interannotator agreements in these three languages is a weakness.
We adopt three labels (Yes, No and Middle) to represent the interpretations of answers to yes-no questions.Some previous works use finer-grained label sets as we discussed in Section 3.1.Considering that (a) there is no universal agreement about the possible ways to interpret answers to yes-no questions and (b) interpretations of answers other than the three we adopted are rare (less than 10% in Circa and Friends-QIA, and less than 20% in SWDA-IA), we argue that three labels are sound.
We run the experiments with two models: XLM-RoBERTa and XLM-Align.We acknowledge that there are other transformers such as InfoXLM and XLM-E.However, they either obtain similar (or worse) results on existing benchmarks, or are not open-source at the time of writing.Regarding opensource Large Language Models (LLMs) such as LLaMa and Alpaca, we do not run experiments on them since (a) fine-tuning with even the smallest version (parameters: 7B) requires significant computing resources; (b) they are pretrained only on Latin or Cyrillic scripts thus may have limitations targeting other languages.
Our in-context learning experiments are conducted with ChatGPT based on GPT-3.5 (textdavinci-002), which was the state-of-the-art model at the time we conducted the experiments.GPT-4 is already out, and better language models will continue to come out.We acknowledge that testing with GPT-4 may obtain better results, but we argue doing so is not necessary: (a) querying on GPT-4 based ChatGPT interface is limited to a few samples per hour, and (b) OpenAI has never announced the list of supported languages so we would be shooting in the dark.A more serious issue is the fact that GPT-X may have been trained with the data we work with, so any results are to be taken with a grain of salt.
We tune several hyperparameters (including the blending factor α, listed in Table 12 and Table 13) with the train and development splits, and report results with the test set.The results are taken from the output of one run.We acknowledge that the average of multiple runs (e.g., 10) would be more reliable, but they also require much more computational resources (literally, 10 more times).

Ethical Considerations
Data sources and collection.Our benchmarks use source texts in accordance with their licenses.We collect corpora in Spanish and Persian from Linguistic Data Consortium 4 under a non-commercial license.Samples in Turkish from MFAQ are used under the Apache-2.0license.Samples in Chinese from NaturalConv and LCCC are used under a noncommercial license and the MIT license respectively.Samples in Hindi from Twitter are used under the Twitter developer policy. 5Samples in Korean from AI Hub are used under a non-commercial license.Samples in Bangla from Bangla2B+ are used under the CC BY-NC-SA 4.0 license.Samples in Nepali from Kantipur Daily are used under a non-commercial license.
Chinese For the Chinese corpora (NaturalConv and LCCC-base), we define the following rule to identify yes-no questions: A conversation turn ends with a modal particle 吗 or 嘛, followed by a question mark ('？'). 6e also define rules to identify direct answers in the turn following yes-no questions.Our rules are defined as follows.The answer turn (i.e., the turn after the question turn) starts with: • yes keywords ( 对 (right), 好 (okay), 嗯 (uhhuh), 恩 (uh-huh), 当然 (of course), 必须 (must)) or no keywords: (不 (no), and 没 (absence)) ; or • the first verb in the question, either in the affirmative or negative form.For example, in the yes-no question 我可以坐这里吗？ (Can I sit here?), the first verb is 可以 (can).This rule matches turns that start with the verb 可 以 (can) or its negated form 不可以 (cannot).
Korean For the Korean corpus, we define the following rules to identify yes-no questions.Recall that the corpus contains questions and answers, so the rules are designed to differentiate between yes-no questions and other questions rather than identifying yes-no questions in any corpus.A turn contains a yes-no question if it does not: • • end in the following modal particle words or phrases that indicate statements instead of questions, as in The rules to identify direct answers in Korean rely on keywords: • yes keywords: ᄂ ᅦ (yes), ᄋ ᅨ (yeah), ᄀ ᅳᄅ ᅥ ᇂ (right), and ᄆ ᅡ ᆽᄋ ᅡ (correct) ; • no keywords: ᄋ ᅡ ᄋ ᅥ ᆹ (ah no), ᄋ ᅨ ᄋ ᅥ ᆹ (yeah, no), ᄋ ᅡ ᆫᄐ ᅡᄁ ᅡ ᆸ (unfortunately), ᄋ ᅡᄂ ᅵ(no), ᄋ ᅡᄂ ᅭ (ney), and ᄋ ᅡᄂ ᅵ ᆸ (no) .

C Annotation Guidelines for the Benchmarks
We conduct manual annotations to obtain ground truth interpretations (i.e., gold labels) for indirect answers.We work with three labels: Yes, No, and Middle (Unknown), In order to minimize inconsistencies, we define labels as follows: • Yes: The answer leans towards (or implies) yes or yes under certain conditions or constraints.The latter could be interpreted as probably yes (e.g., Q: Ever done this before?A: Once.).

D Data Statement
As recommended by Bender and Friedman (2018), we provide a data statement to better understand the new data presented in this paper.

Curation Rationale
We develop new datasets to help interpret indirect answers to yes-no questions in eight languages.
First, we adopt a rule-based approach to collect yesno questions, direct answers, and interpretations of the answers in five languages.The interpretations were obtained by automatically mapping the positive keyword to Yes, and the negative keyword to No.This data is noisy (see quality estimation in Table 3; Precision = 0.93-0.97except in Hindi (0.67)) and intended to be used for distant supervision.Second, we collect yes-no questions with indirect answers and manually annotate their interpretations in the eight languages we work with.For the five languages we used distant supervision with, we collect questions followed by answers not identified by our rules (i.e., without polar keywords).For the other three languages, we collect questions and indirect answers from scratch.
Spanish and Chinese corpora come with context (i.e., the turn before the question and after the answer) since they were obtained from conversations.The corpora include only the questions and answers in the other languages.
We list all the original data sources for our data in Section 3.

Language Variety
We work with eight languages to develop our datasets.The language list is as follows: Bangla as spoken in Bangladesh (bn-BD), Chinese as spoken in Mainland China and written with simplified characters (zh-CN), Korean as spoken in South Korea (ko-KR), Turkish as spoken in Turkey (tr-TR), Spanish as spoken in United States, Canada, Puerto Rico, or The Dominican Republic (es-US, es-CA, es-PR, es-DO), Persian as spoken in Iran (fa-IR), and Nepali as spoken in Nepal (ne-NP).We are not able to provide language variety information for Hindi since the data comes from Twitter.

Speaker Demographic
Our corpora are collected from various sources.We are not able to provide their speaker demographic since such information is absent in the original sources.However, all the corpora are spoken (or written) by native speakers.

Annotator Demographic
We recruit 12 annotators consisting of two women and ten men.Their ages range from 18 to 30 years old.Among them, three individuals are native speakers of Chinese, two of Spanish, two of Bangla, one of Persian (and proficient in Turkish), one of Nepali (and proficient in Hindi), one of Hindi, one of Turkish, and one of Korean.All of them are highly proficient in English.Ethnic backgrounds are as follows: four individuals are from East Asia, four are from South Asia, one is from the Middle East, one is from Europe, one is from North America, and one is from South America.Socioeconomic backgrounds are as follows: all annotators reported that they are middle class.
Educational backgrounds are as follows.We have two undergraduate, nine graduate students, and one with a doctoral degree.Nine of them work in NLP-related research areas, two work in other research areas within Computer Science, and one has Mathematics background.

Speech Situation
Chinese corpora are from two sources: (a) written by crowdsourcing workers who are given a specific topic, and (b) written by users on social media platforms.The Spanish corpus consists of transcripts of telephone conversations between humans.The Nepali corpus is written by journalists from a daily news website.The Korean corpus is written by civilians and covers complaints and replies from public organizations.The Hindi corpus is written by Twitter users.The Turkish corpus is written by humans and consists of frequently asked questions and community questions.The Bangla corpus comes from a crawled dataset from online sources.The Persian corpus is transcriptions of spoken language by humans having informal conversations, including telephone calls and face-to-face interactions.

E Detailed Results for Cross-Lingual
Transferring to a New Language We provide supplemental results in Table 10 that further include Precision (P), Recall (R) and F1score (F) for our cross-lingual learning experiments.This table complements Table 6.

F Experimental Results with XLM-Align
We present the experimental results obtained with XLM-Align in Table 11.The results are listed in three blocks and each block is comparable to Table 6, Table 7 and Table 8 respectively.We observe a similar trend in the results with both models, while XLM-RoBERTa outperforms XLM-Align in most cases.

H Details of In-context Learning with ChatGPT
We test 30 random samples for each language with ChatGPT (text-davinci-002) and our best model, as explained in Section 5.5. Figure 2 presents the prompt.In order to give ChatGPT the most credit possible, we manually map the generated answers from ChatGPT into Yes, No, and Middle.
(My boss hasn't told me next month's plan.)Interpretation: Middle Table

Table 2 :
Statistics of existing English corpora with yesno questions, indirect answers, and their interpretations.
, most likely because we work with Hindi tweets.We use the interpretations of direct answers obtained via distant supervision in these five languages to automatically interpret indirect answers in eight languages.

Table 4 :
ll just give you cold medicine rather than a headache reliever.)Examples from the benchmarks we create in eight languages.Answers whose interpretations lean towards yes or no are annotated as such; middle is used for 50/50 splits and unresponsive answers.

Table 5 :
Number of instances and label frequency in the datasets obtained via distant supervision (top, five languages), and benchmarks for new languages (bottom, eight languages).Yes interpretations are almost always the most frequent in both training datasets and benchmarks.Middle is the least frequent in five languages in benchmarks.

Table 6 :
Results (F1) obtained with XLM-RoBERTa, a multilingual transformer, trained with English corpora (i.e., cross-lingual in the new language).Appendix E presents additional results (P and R).Training with Circa and SWDA-IA is more beneficial in most cases.Training with SWDA-IA alone, however, shows limited improvements.

Table 7 :
Table 6 presents the results.Training with Circa is always better than (or equal to) SWDA-IA, but combining both yields the best results with most languages.The only exception is Hindi, where training with Circa obtains better results.This is intuitive considering that Circa is a much larger dataset than SWDA-IA (Table2).The lower re-Results obtained in multiple languages with XLM-RoBERTa, a multilingual transformer, trained blending with English corpora (best combination from Table

Table 8 :
Results obtained with XLM-RoBERTa, a multilingual transformer, trained blending with English corpora (best combination from Table5.1) and the instances obtained with distant supervision in other languages (Column 2).We show only the best combination of languages.%∆Fenand%∆Findicatetheimprovements compared to training with English only (best in Table6) and blending English and the language we evaluate with (Table7).An asterisk indicates that the improvements are statistically significant(McNemar's test (McNemar, 1947), p < 0.05).
ᅧᄀ ᅮᄋ ᅭ. (I was calling if I can file a civic complaint about reckless driving).The full list of particles is as follows:

Table 12 :
Tuned hyperparameters for our experiments with XLM-RoBERTa-base model.

Table 13 :
Tuned blending factor α for our experiments with XLM-RoBERTa-base model.I need you to help me understand the underlying meanings of indirect answers to yes-no questions.You can choose from Yes, No, or Middle.You should reply to me with Yes, No or Middle based on your interpretation of the answer.