Interpreting Answers to Yes-No Questions in User-Generated Content

Interpreting answers to yes-no questions in social media is difficult. Yes and no keywords are uncommon, and the few answers that include them are rarely to be interpreted what the keywords suggest. In this paper, we present a new corpus of 4,442 yes-no question-answer pairs from Twitter. We discuss linguistic characteristics of answers whose interpretation is yes or no, as well as answers whose interpretation is unknown. We show that large language models are far from solving this problem, even after fine-tuning and blending other corpora for the same problem but outside social media.


Introduction
Social media has become a global town hall where people discuss whatever they care about in real time.Despite its challenges and potential misuse, the language of social media can be used to approach many problems, including mental health (Harrigian et al., 2020), rumor detection (Ma and Gao, 2020), and fake news (Mehta et al., 2022).
There is extensive literature on question answering, but few previous efforts work with social media.In particular, yes-no questions are unexplored in social media.Figuring out the correct interpretation of answers to yes-no questions (yes, no, or somewhere in the middle) from formal texts (Clark et al., 2019), short texts (Louis et al., 2020), and conversations (Choi et al., 2018;Reddy et al., 2019) is challenging.This is the case in any domain, but especially in social media, where informal language and irony are common.Additionally, unlike previous work on yes-no questions outside social media, we find that yes and no keywords are not only rare in answers, but also that they are poor indicators of the correct interpretation of the answer.Consider the Twitter thread in Figure 1.The question is mundane: whether people preheat the oven before using it.The answers, however, are complicated to decipher.The first answer states that it is very strange that somebody else (the author's girlfriend) does not preheat the oven.Note that the correct interpretations is yes (or probably yes) despite the negation.The second answer explains why it is unnecessary to preheat the ovenwithout explicitly saying so-thus the correct interpretation is no.The third answer includes both yes and no yet the correct interpretation is unknown, as the author does not commit to either option.
In this paper, we tackle the problem of interpreting natural answers to yes-no questions posted on Twitter by real users, as exemplified above.The main contributions are:1 (a) Twitter-YN, a new corpus of 4,442 yes-no questions and answers from Twitter along with their interpretations; (b) analysis discussing the language used in the answers, and in particular, showing that keyword presence in answers is a bad predictor of their interpretations, (c) experimental results showing that the problem is challenging-even for large language models, including GPT3-and blending with related corpora is useful; and (d) qualitative analysis of the most common errors made by our best model.
Motivation Traditionally, question answering assumes that there are (a) correct answers to questions or (b) no answers (Kwiatkowski et al., 2019).In the work presented in this paper, answers are available-the problem is to interpret what they mean.As Figure 1 shows, answers capture personal preferences rather than correct answers supported by commonsense or common practice (e.g., oven manuals state that preheating is recommended).Interpreting answers to yes-no questions opens the door to multiple applications.For example, dialogue systems for social media (Sheng et al., 2021) must understand answers to yes-no question to avoid inconsistencies (e.g., praising the author of the second reply in Figure 1 for preheating the oven is nonsensical).Aggregating interpretations could also assist in user polling (Lampos et al., 2013).As the popularity of chat-based customer support grows (Cui et al., 2017), automatic interpretation of answers to yes-no questions could enhance the customer experience.Lastly, interactive question answering and search, in which systems ask clarification questions to better address users' information needs, would benefit from this work (Li et al., 2017).Clarification questions are often yesno questions, and as we shall see, people rarely use yes, no or similar keywords in their answers.
Rather, people answer with intricate justifications that are challenging to interpret.

Previous Work
Question Answering Outside Social Media has a long tradition.There are several corpora, including some that require commonsense (Talmor et al., 2019) or understanding tables (Cheng et al., 2022), simulate open-book exams (Mihaylov et al., 2018), and include questions submitted by "real" users (Yang et al., 2015;Kwiatkowski et al., 2019).Models include those specialized to infer answers from large documents (Liu et al., 2020), data augmentation (Khashabi et al., 2020), multilingual models (Yasunaga et al., 2021), and efforts to build multilingual models (Lewis et al., 2020).None of them target social media or yes-no questions.
Yes-No Questions and interpreting their answers have been studied for decades.Several works target exclusive answers not containing yes or no (Green and Carberry, 1999;Hockey et al., 1997).Rossen-Knill et al. (1997) work with yes-no questions in the context of navigating maps (Carletta et al., 1997) and find that answers correlate with actions taken.More recent works (de Marneffe et al., 2009) target questions from SWDA (Stolcke et al., 2000).Clark et al. (2019) present yes-no questions submitted to a search engine, and Louis et al. ( 2020) present a corpus of crowdsourced yes-no questions and answers.Several efforts work with yes-no questions in dialogues, a domain in which they are common.Annotation efforts include crowdsourced dialogues (Reddy et al., 2019;Choi et al., 2018) and transcripts of phone conversations (Sanagavarapu et al., 2022) and Friends (Damgaard et al., 2021).
Question Answering in the Social Media Domain is mostly unexplored.TweetQA is the only corpus available (Xiong et al., 2019).It consists of over 13,000 questions and answers about a tweet.Both the questions and answers were written by crowdworkers, and the creation process ensured that it does not include any yes-no question.In this paper, we target yes-no questions in social media for the first time.We do not tackle the problem of extracting answers to questions.Rather, we tackle the problem of interpreting answers to yesno questions, where both questions and answers are posted by genuine social media users rather than paid crowdworkers.

Twitter-YN: A Collection of Yes-No
Questions and Answers from Twitter Since previous research has not targeted yes-no questions in social media, our first step it to create a new corpus, Twitter-YN.Retrieving yes-no questions and their answers takes into account the intricacies of Twitter, while the annotation guidelines are adapted from previous work.

Retrieving Yes-No Questions and Answers
We define a battery of rules to identify yes-no questions.These rules were defined after exploring existing corpora with yes-no questions (Section 2), but we adapted them to Twitter.Our desideratum is to collect a variety of yes-no questions, so we prioritize precision over recall.This way, the annotation process is centered around interpreting answers rather than selecting yes-no questions from the Twitter fire hose.The process is as follows: 1. Select tweets that (a) contain a question mark ('?') and the bigram do you, is it, does it or can it; and (b) do not contain wh-words (where, who, why, how, whom, which, when, and whose).Let us name the text between the bigram and '?' a candidate yes-no question.The first step identifies likely yes-no questions.We found that avoiding other questions with whwords is critical to increase precision.The second step disregards short and long questions.The reason to discard questions with named entities is to minimize subjective opinions and biases during the annotation process-most yes-no questions involving named entities ask for opinions about the entity (usually a celebrity).We discard questions with numbers because the majority are about cryptocurrencies or stocks and we want to avoid having tweets about the financial domain (out of a random sample of 100 questions, 68% of questions with numbers were about cryptocurrencies or stocks).The third step increases precision by considering only verified accounts, which are rarely spam.Avoiding hashtags and links allows us to avoid answers whose interpretation may require external information including trending topics.
Once yes-no questions are identified, we consider as answers all replies except those that contain links, more than one user mention (@user), no verb, less than 6 or more than 30 tokens, or question marks ('?').The rationale behind these filters is to avoid answers whose interpretation require external information (links) or are unresponsive.We do not discard answers with yes or no keywords because as we shall see (and perhaps unintuitively) they are not straightforward to interpret (Section 4).
Temporal Spans: Old and New We collect questions and their answers from two temporal spans: older (years 2014-2021) and newer (first half of 2022).Our rationale is to conduct experiments in a realistic setting, i.e., training and testing with questions posted in non-overlapping temporal spans.
The retrieval process resulted in many yes-no question-answer pairs.We set to annotate 4,500 pairs.Because we wanted to annotate all answers to the selected questions across both temporal spans, Twitter-YN consists of 4,442 yes-no question-answer pairs (old: 2,200, new: 2,242).The total number of questions are 1,200 (3.7 answers per question on average).

Interpreting Answers to Yes-No Questions
After collecting yes-no questions and answers, we manually annotate their interpretations.We work with the five interpretations exemplified in Table 1.Our definitions are adapted from previous work (de Marneffe et al., 2009;Louis et al., 2020;Sanagavarapu et al., 2022) and summarized as follows: • Yes: The answer ought to be interpreted as yes without reservations.In the example, the answer doubles down on the comment that it is rude to suggest that somebody would look better if they lost weight (despite also using not) by characterizing the comment as lethal.• No: The answer ought to be interpreted as no without reservations.In the example, the author uses irony to effectively state that one should not read books in bars.• Probably yes: The answer is yes under certain condition(s) (including time constraints) or it shows arguments for yes from a third party.In the example, the answer relies on the weather report to mean that it will be big coat weather.• Probably no: The answer is a no under certain condition(s) or shows arguments for no held by a third party, or the author shows hesitancy.
In the example, the author indicates that it is not bad to brake with your left foot while driving if you are a racecar driver.• Unknown: The answer disregards the question (e.g., changes topics) or addresses the question without an inclination toward yes or no.In the example, the author states that the question is irrelevant without answering it.Appendix A presents the detailed guidelines.Annotation Process and Quality We recruited four graduate students to participate in the definition of the guidelines and conduct the annotations.The 4,442 yes-no question-answers pairs were divided into batches of 100.Each batch was annotated by two annotators, and disagreements were resolved by an adjudicator.In order to ensure quality, we calculated inter-annotator agreement using linearly weighted Cohen's κ.Overall agreement was κ = 0.68.Note that κ coefficients above 0.6 are considered substantial (Artstein and Poesio, 2008) (above 0.8 are considered nearly perfect).Figure 2 shows the disagreements between the interpretations by annotators and the adjudicator.Few disagreements raise concerns.The percentage of disagreements between yes and no is small: 1.40, 1.44, 1.28, and 0.88.We refer the reader to the Data Statement in Appendix B for further details about Twitter-YN.

Analyzing Twitter-YN
Twitter-YN is an unbalanced dataset: the interpretation of 70% of answers is either yes or no (  and 2) and keyword-based analysis (Column 3 and 4).Only 13.42% (18.91%) of answers with yes (no) interpretations include a positive (negative) keyword.Interpreting answers cannot be reduced to checking for positive or negative keywords.keyword, and those that do often are not interpreted what the keyword suggests.For example, 11.98% of answers labeled yes contain a negative keyword, while only 13.42% contain a positive keyword.
Are Keywords Enough to Interpret Answers?No, they are not.We tried simple keyword-based rules to attempt to predict the interpretations of answers.The rules we consider first check for yes (yes, yea, yeah, sure, right, you bet, of course, certainly, definitely, or uh uh), no (no, nope, no way, never, or n't), and uncertainty (maybe, may, sometimes, can, perhaps, or if ) keywords.Then they return (a) yes (or no) if we only found keywords for yes (or no), (b) probably yes (or probably no) if we found keywords for uncertainty and yes (or no), (c) unknown otherwise.As Figure 3 shows, such a strategy does not work.Many instances are wrongly predicted unknown because no keyword is present, and keywords also fail to distinguish between no and probably no interpretations.

Linguistic Analysis
Despite all yes-no questions include one of four bigrams (do you, is it, does it or can it, Section 3.1), and 97% include do you or is it, Twitter-YN has a wide variety of questions.Indeed, the distribution of verbs has a long tail (Table 3).For example, less than 10% of questions containing do you have the verb think, Further, the the top-10 most frequent verbs account for 39.3% of questions starting with do you.Thus, the questions in Twitter-YN are diverse from a lexical point of view.
We also conduct an analysis to shed light on the language used in answers to yes-no questions (Table 4).Verbs and pronouns are identified with spaCy (Honnibal and Montani, 2017).We use WordNet lexical files (Miller, 1995) as verb classes, the negation cue detector by Hossain et al. (2020), and the sentiment lexicon by Crossley et al. (2017).
First, we compare answers interpreted as yes and no.We observe that cardinal numbers are more common in yes rather than no interpretations.Object pronoun me is also more common with yes interpretations; most of the time it is used to describe personal experiences.Despite the low performance of the keyword-based rules, negation cues and negative sentiment are more frequent with no interpretations.Finally, emojis are more frequent in humorous answers meaning yes rather than no.
Second, we compare unknown and other interpretations (i.e., when the author leans towards neither yes nor no).We observe that longer answers are less likely to be unknown, while more verbs indicate unknown.All verb classes are indicators of unknown.For example, communication (say, tell, etc.), creation (bake, perform etc.), motion (walk, fly etc.), and perception (see, hear etc.) verbs appear more often with unknown answers.Finally, unknown answers rarely have negation cues.

Automatically Interpretation of Answers to Yes-No Questions
We experiment with simple baselines, RoBERTabased classifiers, and GPT3 (zero-and few-shot).
Experimental settings Randomly splitting Twitter-YN into train, validation, and test splits would lead to valid but unrealistic results.Indeed, in the real world a model to interpret answers to yes-no questions ought to be trained with (a) answers to different questions than the ones used for evaluation, and (b) answers to questions from non-overlapping temporal spans.
In the main paper, we experiment with the most realistic setting: unmatched question and unmatched temporal span (train: 70%, validation: 15%, test: 15%).We refer the reader to Appendix C for the other two settings: matched question and matched temporal span and unmatched question and matched temporal span.
Baselines We present three baselines.The majority class always chooses the most frequent label in the training split (i.e., yes).The keyword-based rules are the rules presented in Section 4. We also experiment with a version of the rules in which the lexicon of negation cues is replaced with a negation cue detector (Hossain et al., 2020).

RoBERTa-based Classifiers
We experiment with RoBERTa (Liu et al., 2019) as released by TensorFlow (Abadi et al., 2015) Hub.Hyperparameters were tuned using the train and validation splits, and we report results with the test split.We refer readers to Appendix D for details about hyperparameters and the tuning process.
Pretraining and Blending In addition to training with our corpus, Twitter-YN, we also explore combining related corpora.The most straightforward approach is pretraining with other corpora and then fine-tuning with Twitter-YN.Doing so is sound but obtains worse results (Appendix E) than blending, so we will focus here on the latter.
Pretraining could be seen as a two-step finetuning process: first with related corpora and then with Twitter-YN.On the other hand, blending (Shnarch et al., 2018) combines both during the same training step.Blending starts fine-tuning with the combination of both and decreases the portion of instances from related corpora after each epoch by an α ratio.In other words, blending starts tuning with many yes-no questions (from the related corpus and Twitter-YN) and finishes only with the ones in Twitter-YN.The α hyperparameter is tuned like any other hyperparameter; see Appendix D.
We experiment with pretraining and blending with the following corpora.All of them include yes-no questions and manual annotations indicating the interpretation of their answers.However, none of them are in the Twitter domain: • BOOLQ (Clark et al., 2019), 16,000 natural yes-no questions submitted to a search engine and (potential) answers from Wikipedia; • Circa (Louis et al., 2020), 34,000 yes-no questions and answers falling into 10 predefined scenarios and written by crowdworkers; • SWDA-IA (Sanagavarapu et al., 2022), ≈2,500 yes-no questions from transcriptions of telephone conversations; and • FRIENDS-QIA (Damgaard et al., 2021), ≈6,000 yes-no questions from scripts of the popular Friends TV show.
Prompt-Derived Knowledge.Liu et al., 2022 have shown that integrating knowledge derived from GPT3 in question-answering models is beneficial.We follow a similar approach by prompting GPT3 to generate a disambiguation text given a question-answer pair from Twitter-YN.Then we complement the input to the RoBERTa-based classifier with the disambiguation text.For example, given Q: Do you still trust anything the media says?A: Modern media can make a collective lie to become the truth, GPT3 generates the following disambiguation text: Mentions how media can manipulate the truth, they do not trust anything the media says.We refer the reader to Appendix G for the specific prompt and examples of correct and nonsensical disambiguation texts.

GPT3: Zero-and Few-Shot
Given the success of of large language models and prompt engineering in many tasks (Brown et al., 2020), we are keen to see whether they are successful at interpreting answers to yes-no questions from Twitter.We experiment with various promptbased approaches and GPT3.

Results and Analysis
We present results in the unmatched questions and unmatched temporal span setting in Table 5. Appendix C presents the results in the other two settings.We will focus the discussion on average F1 scores as Twitter-YN is unbalanced (Section 4).
Regarding the baselines, using a negation cue detector to identify no keywords rather than our predefined list is detrimental (F1: 0.27 vs. 0.37), especially with yes label.This result leads to the conclusion that many answers with ground truth interpretation yes include negation cues such as affixes im-and -less, which are identified by the cue detector but are not in our list.Zero-and Few-Shot GPT3 outperforms the baselines, and obtains moderately lower results than the supervised model trained on Twitter-YN (Question and Answer; F1: 0.53 vs. 0.56).Including the annotation guidelines in the zero and few-shot prompts obtains worse results (Appendix G, Table 18).
The supervised RoBERTa-based systems trained with other corpora yield mixed results.Circa, SWDA-IA, and Friends-QIA are the ones which outperform the keyword-based baseline (0.53, 0.47 and 0.51 vs. 0.37).We hypothesize that this is due to the fact that BoolQ includes yes-no questions from formal texts (Wikipedia).Unsurprisingly, training with Twitter-YN yields the best results.The question alone is insufficient (F1: 0.29), but combining it with the answer yields better results than the answer alone (0.56 vs. 0.48).
It is observed that Friends-QIA and Circa are the only corpora worth blending with Twitter-YN.Both obtain better results than training with only Twitter-YN (F1: 0.58 in both vs. 0.56).Finally, including in the input the disambiguation texts automatically generated by GPT3 further improves the results (F1: 0.61) blending with Circa.On the other hand, disambiguation texts are detrimental when blending with Friends-QIA.These results were surprising to us as disambiguation texts are often nonsensical.Appendix G (Table 17) provides several examples of disambiguation texts generated by GPT3, including sensical and nonsensical texts.
Regarding statistical significance, it is important to note that blending Circa and using disambiguation texts yields statistically significantly better results than few-shot GPT3.GPT3, however, is by no means useless: it provides a solid few-shot baseline and the disambiguation texts can be used to strengthen the supervised models.6: Most frequent errors made by the best performing system in the unmatched question and unmatched temporal span setting (Table 5).Error types may overlap, and P and G stand for prediction and ground truth respectively.In addition to yes and no distractors, social media lexicon (emojis, abbreviation, etc.), humor (including irony and sarcasm), presence of (a) conditions or contrasts or (b) external entities are common errors.

Error Analysis
Table 6 illustrates the most common errors made by our best performing model (RoBERTa blending Twitter-YN (with disambiguation text generated by GPT3) and Circa, Table 5).note that error types may overlap.For example, a wrongly predicted instance may include humor and a no distractor).
Yes and no keywords are not good indicators of answer interpretations (Section 4), and they act as distractors.In the first and second examples, the answers include sure and n't, but their interpretations are no and yes respectively.Indeed, we observe them in 33% of errors, especially no distractors (28.6%).Many errors (20.7%) contain social media lexicon items such as emojis and hashtags, making their interpretation more difficult.Humor including irony is also a challenge (13.2% of errors).In the example, which also includes a contrast, the answer implies that the author likes coffee as the fruit fly has a positive contribution.
Answers with conditions and contrasts are rarely to be interpreted unknown, yet the model often does so.Mentioning external entities also poses a challenge as the model does not have any explicit knowledge about them.Finally, unresponsive answers are always to be interpreted unknown, yet the model sometimes (11.4% of errors) fails to properly interpret unresponsive answers and mislabels them as either yes or no.

Conclusions
We have presented Twitter-YN, the first corpus of yes-no questions and answers from social media annotated with their interpretations.Importantly, both questions and answer were posted by real users rather than written by crowdworkers on demand.As opposed to traditional question answering, the problem is not to find answers to questions, but rather to interpret answers to yes-no questions.
Our analysis shows that answers to yes-no questions vary widely and they rarely include a yes or no keyword.Even if they do, the underlying interpretation is rarely what one may think the keyword suggests.Answers to yes-no questions usually are long explanations without stating the underlying interpretation (yes, probably yes, unknown, probably no or no).Experimental results show that the problem is challenging for state-of-the-art models, including large language models such as GPT3.
Our future plans include exploring combination of neural-and knowledge-based approaches.In particular, we believe that commonsense resources such as ConceptNet (Speer et al., 2017), ATOMIC (Sap et al., 2019), and Common-senseQA (Talmor et al., 2019) may be helpful.The challenge is to differentiate between common or recommended behavior (or commonly held commonsense) and what somebody answered in social media.Going back to the example in Figure 1, oven manuals and recipes will always instruct people to preheat the oven.The problem is not to obtain a universal ground truth-it does not exist.Instead, the problem is to leverage commonsense and reasoning to reveal nuances about how indirect answers to yes-no questions ought to be interpreted.

Limitations
Twitter-YN, like any other annotated corpus, is not necessarily representative (and it certainly is not a complete picture) of the problem at hand-in our case, interpreting answers to yes-no questions in user generated content.Twitter-YN may not transfer to other social media platforms.More importantly, our process to select yes-no questions and answers disregards many yes-no questions.In addition to working only with English, we disregard yes-no questions without question marks (e.g., I wonder what y'all think about preheating the oven); the questions we work with are limited to those containing four bigrams.We exclude unverified accounts to avoid spam and maximize the chances of finding answers (verified accounts are more popular), but this choice ignores many yes-no questions.Despite the limitations, we believe that the variety of verbs and including two temporal spans alleviates this issue.Regardless of size, no corpus will be complete.
Another potential limitation is the choice of labels.We are inspired by previous work (Section 2 and 3), some of which work with five labels and others with as many as nine.We found no issues using five labels during the annotation effort, but we certainly have no proof that five is the best choice.
Lastly, we report negative results using several prompts and GPT3-although prompting GPT3 is the best approach if zero or few training examples are available.Previous work on prompt engineering and recent advances in large language models makes us believe that it is possible that researchers specializing in prompts may come up with better prompts (and results) than we did.Our effort on prompting and GPT3, however, is an honest and thorough effort to the best of our ability with the objective of empirically finding the optimal prompts.Since these large models learn from random and maliciously designed prompts (Webson and Pavlick, 2022), we acknowledge that other prompts may yield better results.

Ethics Statement
While tweets are publicly available, users may not be aware that their public tweets can be used for virtually any purpose.Twitter-YN will comply with the Twitter Terms of Use.Twitter-YN will not include any user information, although we acknowledge that it is easily accessible given the tweet id and basic programming skills.The broad temporal spans we work with minimize the opportunities for monitoring users, but it is still possible.
While it is far from our purposes, models to interpret answers to yes-no questions could be used for monitoring individuals and groups of people.Malicious users could post questions about sensitive topics, including politics, health, and deeply held believes.In turn, unsuspecting users may reply with their true opinion, thereby exposing themselves to whatever the malicious users may want to do (e.g., deciding health insurance quotes based on whether somebody enjoys exercising).That said, this hypothetical scenario also opens the door to users to answer whatever they wish, truthful or not-and in the process, mislead the hypothetical monitoring system.The work presented here is about interpreting answers to yes-no questions in Twitter, not figuring the "true" answer.If somebody makes a Twitter persona, we make no attempt to reveal the real person underneath.

A Annotation Guidelines
We give four annotators the following annotation guidelines for each of the five labels.These guidelines are also used in some of the zero-shot and few-shot prompts.

yes (y)
The reply is an affirmative reply to the question without reservations.
Or, the reply states the question in affirmative terms.For example, Mexican food is delicious.
states Do you like Mexican food? in positive terms.

probably yes (py)
The reply is an affirmative reply under certain conditions.
Or, the reply states the question in hopeful terms.For example, I am keen on eating tacos this weekend states Do you like Mexican food? in hopeful terms.
Or, the reply is affirmative but refers to a point of time in the future.Or, the reply states that an entity (person, group of people, place, organization, etc.) leans towards yes.For example, Q: Do you like Mexican food?A: My whole family loves it.

no (n)
The reply is a negative reply to the question without reservations.
Or, the reply states the question in negative terms.For example, Mexican food is bland states Do you like Mexican food? in negative terms.

probably no (pn)
The reply is a negative reply under certain conditions.
Or, the reply states that an entity (person, group of people, place, organization, etc.) leans towards no.For example, Q: Do you like Mexican food?A: My friends avoid Mexican restaurants.

unknown (uk)
A reply that is responsive to the question but does not lean towards yes or no.
Or, a reply that does not address the question at all.

B Data Statement
As per the recommendations by Bender and Friedman (2018), we provide here a Data Statement for better understanding of Twitter-YN.

B.1 Curation Rationale
We develop Twitter-YN to build and evaluate models to interpret answers to yes-no questions in usergenerated content.Twitter-YN includes yes-no questions and answers from Twitter along with annotations indicating the interpretation of the answers.
Questions come from main tweets (i.e., any tweet except replies and retweets).Answers are reply tweets to the question tweets (retweets do not count as replies).Questions and answers are selected based on manually defined rules.These rules are developed and refined through an iterative process.We scrape Twitter for question-answer pairs and assign 100 instances to four annotators (each annotator gets the same instances).Based on (a) feedback by each of these four annotators on the questions and answers and (b) agreement scores (linearly weighted Cohen's κ), the rules are refined.
A new set of 100 random instances is then collected based on the refined rules.This is once again given for annotation to the four annotators.This process continues until annotators find that the questions and answers identified with the rules are actual yes-no questions and answers to the questions.
The final version of the rules used to scrape yesno questions and answers from Twitter are detailed in Section 3.1.Section 3.2 provides descriptions of the five interpretations annotators choose from (i.e., annotation guidelines), and Table 1 presents examples.All question-answer pairs (4,442) were annotated by two annotators independently.Four annotators participated in the annotations.Interannotator agreement (linearly weighted Cohen's κ) is 0.68, indicating substantial agreement (Artstein and Poesio, 2008); over 0.8 would be nearly perfect.After the independent double annotation process, the adjudicator resolves the disagreements to create the ground truth.

B.2 Language Variety
The data collection process was carried out from May to July 2022.Question tweets are in English ('en' as per the Twitter API).We also use SpaCy to confirm that the reply is in English.Information on the specific type of English (American, British, Australian etc.) is not available.

B.3 Speaker Demographic
The questions come from 1,200 unique question tweets posted by verified twitter accounts (as of July 2022).We look specifically at verified users only, since such tweets are more popular and are therefore more likely to have replies.We do not require replies to come from verified accounts.As per Twitter age restrictions, the minimum user age is 13 years for authors of both questions and answers.Speakers are not reachable in our scenario, thus demographic information is limited.

B.4 Annotator Demographic
Four annotators (including one adjudicator) are part of the annotation process and development of annotation guidelines.All of the individuals are graduate students who are fluent in English.Annotators are three men and one woman, and their ages range from 18 to 40 years old.Ethnic backgrounds are as follows: one Asian and three South Asian.Annotators reported that they are from a middle class economic background.
The final version of the annotation guidelines is developed by considering the annotator feedback during pilot annotations.There is no overlap between the questions and answers used in the pilot annotations and the ones that are in Twitter-YN.

B.5 Speaker Situation
Text in Twitter-YN is retrieved from Twitter between May and July of 2022.Modality of text is written (typed by speaker).Twitter allows users (speakers) to edit what they tweet.We use the version of the tweet available as of July 2022 regardless of the original posting date.Twitter-YN includes asynchronous interactions since reply tweets are posted only after the tweet with the question is posted.Questions and answers cannot be appear in Twitter simultaneously.The intended audience for the text could be anyone on the Internet.

B.6 Text Characteristics
Genre of text is in the social media domain.Question-answer instances are generally based on topics such as general advice, day-to-day chores, health, food, finance, music, policies, and politics.The text is informal.Social media cues such as shorter forms of phrases (e.g., lol: laughing out loud), slang, and emojis are common.Most tweets are text only (including Unicode symbols such as emojis), although a few also contain images.Twitter-YN does not include these images.Since the test splits are not the same, the results in Tables 7 and 8 are not comparable.Because of the same reason, the results in Table 5 and either Table 7 or 8 cannot be compared either.

D Hyperparameters
All RoBERTa-based classifiers were tuned using the train and validation splits.Results (Tables 5  and 7-8, 12-14), were obtained with the test split after the tuning process.We experimented with the following ranges for hyperparameters, and chose the optimal values based on the loss calculated with the validation split: • Learning rate: 1e-5, 2e-5, 3e-5 • Epochs: Up to 200 with early stopping (patience = 3, min_delta = 0.01) • Batch size: 16, 32 • Blending factor (α): 0.2-0.8Tables 9-11 present the tuned hyperparameters in the three experimental settings.These tables complement the results presented in Tables 5, 7, and 8. Hyperparameters were tuned with the train and validation splits.

E Results Pretraining with Related Corpora
Tables 12-14 present results pretraining with yesno questions and then fine-tuning with Twitter-YN rather than blending.These tables complement Tables 5, 7, and 8.Although pretraining sometimes outperforms blending in two settings, blending always outperforms pretraining in the most important setting: unmatched question and unmatched temporal span.

F Detailed Error Analysis
Tables 15 and 16 detail the error types exemplified in Table 6.Specifically, we provide the percentages of errors for each combination of ground truth (gold) and predicted interpretations.

G Experiments Using Prompts and GPT3
Given the success of of large language models and prompt engineering in many tasks (Brown et al., 2020), we are keen to see whether they are successful at interpreting answers to yes-no questions from Twitter.We experiment with various prompt-based approaches and GPT3.Specifically,

Figure 1 :
Figure 1: Yes-No question from Twitter and three answers.Interpreting answers cannot be reduced to finding keywords (e.g., yes, of course).Answers ought to be interpret as yes (despite the negation), no (commonsense and reasoning are needed), and unknown respectively.

Figure 2 :
Figure 2: Differences between adjudicated interpretations and interpretations by each annotator.There are few disagreements; the diagonals (i.e., the adjudicated interpretation and the one by each annotator) account for 84.72% and 84.37%.Additionally, most disagreements are minor: they are between (a) yes (no) and probably yes (probably no) or (b) unknown and another label.

Figure 3 :
Figure 3: Heatmap comparing the interpretations predicted with keyword-based rules (Section 4) and the ground truth.Rules are insufficient for this problem.
8 present results with the unmatched question and matched temporal span and matched question and matched temporal span settings.These tables complement the results discussed in Section 5.While valid, these settings are unrealistic in terms of what the model is trained and evaluated with.In the first setting, models are trained with question-answer pairs posted during the same temporal span than those in the test split.In the second setting, models are trained with answers to the same questions seen during training.

Table 1 :
Examples of yes-no questions and answers from our corpus.Some answers include negative keywords but their interpretation is yes (first example), whereas others include positive keywords but their interpretation is no (second example).Answers imposing conditions are interpreted probably yes or probably no (third and fourth examples).Negation does not necessarily indicate that the author leans to either yes or no (fifth example).2.Exclude candidate questions unless they(a) are between 3 and 100 tokens long; (b) do not span more than one paragraph; and (c) do not contain named entities or numbers.3. Exclude candidate questions with links, hashtags, or from unverified accounts.

Table 2 )
. More interestingly, few answers contain a yes or no

Table 2 :
Distribution of interpretations in our corpus (Columns 1

Table 3 :
Most common bigrams and verbs in the questions.Most questions include the bigrams do you or is it (97%).The verb percentages show that the questions are diverse.Indeed, the top-10 most frequent verbs with do you only account for 39.3% of questions with do you.

Table 4 :
Linguistic analysis comparing answers interpreted as (a) yes and no and (b) unknown and the other labels.Number of arrows indicate the p-value (t-test; one: p<0.05, two: p<0.01, and three: p<0.001).Arrow direction indicates whether higher counts correlate with the first interpretation (yes or unknown) or the second.

Table 5 :
Results in the unmatched question and unmatched temporal span setting.Only training with other corpora obtains modest results, and taking into account the question in addition to the answer is beneficial (Twitter-YN, F1: 0.56 vs. 0.48).Blending Circa and Friends-QIA is beneficial (F1: 0.58 vs. 0.56).Adding the disambiguation text from GPT3 with Circa is also beneficial (F1: 0.61).Few-Shot prompt with GPT3 obtains results close to a supervised model trained with Twitter-YN (F1: 0.53 vs. 0.56), but underperforms blending Circa using the disambiguation text from GPT3 (0.61).Statistical significance with respect to training with Twitter-YN (Q and A) is indicated with '*' and with respect to few-shot GPT3 with ' †'(McNemar's Test (McNemar, 1947), p<0.05).

Table
learned from the chameleon testbed.In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC '20).USENIX Association.

Table 8 :
Results with the test split of Twitter-YN (our corpus) in the matched question and matched temporal span setting.We present results with (a) baselines, (b) training with yes-no questions corpora (other corpora and the training split of Twitter-YN), and (c) blending the training split of Twitter-YN and other corpora.While direct comparison is unsound because the training and test splits differ, we generally observe higher or the same results than the ones in the unmatched question and unmatched temporal span setting (Table5).Unsurprisingly, using (different) answers to the same questions in training and testing is beneficial.Note, however, that these results are unrealistic.

Table 9 :
Hyperparameters with the unmatched question and unmatched temporal span setting found after tuning with the development set.This table complements Table5.

Table 10 :
Hyperparameters with the unmatched question and matched temporal span setting found after tuning with the development set.This table complements Table7.

Table 11 :
Hyperparameters with the matched question and matched temporal span setting found after tuning with the development set.This table complements Table8.