Answer Quality Aware Aggregation for Extractive QA Crowdsourcing

,


Introduction
Extractive Question answering (EQA) is a fundamental task in natural language processing (Parsing, 2009).With access to large-scale datasets, deep neural models have achieved significant advances in the EQA task (Lewis et al., 2019;Devlin et al., 2018;Zhang et al., 2020).Creating large-scale high-quality datasets is one of the essential factors driving progress (Rogers et al., 2021).Currently, a prevalent method for creating EQA datasets is crowdsourcing (Rajpurkar et al., 2016(Rajpurkar et al., , 2018;;Trischler et al., 2016;Yang et al., 2018;Talmor et al., 2018) thanks to its efficiency and scalability due to the availability of crowd workers.Yet, answers collected from crowd workers often WASHINGTON (CNN) --During the presidential campaign, then-candidate Barack Obama said that he hoped his administration wouldn't get [...]ssue.Former Republican Speaker of the House Newt Gingrich called Sotomayor a racist.Conservative talk [...] a better conclusion than a white male who hasn't lived that life."One top GOP senator said he wants more than an explanation."I think she should apologize, but I don't believe any American wants a judge on the bench that's going to use empathy or their background to punish someon."She's been called the equivalent of the head of the Ku Klux Klan by Rush Limbaugh; [...] yor's appellate court decision against a mostly white group of firefighters who say they were discriminated against after a promotion test was thrown out, because critics said it discriminated against minority firefighters.But legal experts have said her full record on race isn't that controversial --in 96 race-related cases decided by Sotomayor on the court of appeals, ...

Question
What did the GOP leaders say?

Agreement Measure
Newt Gingrich called Sotomayor a racist 0 0.3433 he wants more than an explanation 0 0.3118 they were discriminated against after a promotion test was thrown out, because critics said it discriminated against minority firefighters.2 0.5564 Figure 1: An example of answer aggregation for QA crowdsourcing.In this example, three crowd workers are asked to select a word span in the passage as the answer to the question.The gold answer can be aggregated from the disagreed answers by asking another group of workers for answer selection (vote) or using answer aggregation models (aggregation measure).
contain a substantial amount of noise due to the reliability issue of crowd workers affected by their varying expertise, skills, and motivation (Kazai et al., 2011;Geva et al., 2019).
To reduce noise in crowdsourced data, a widelyadopted solution in previous crowdsourcing research is to assign each instance to multiple crowd workers to crate redundant annotations (Trischler et al., 2016;Yang et al., 2018;Talmor et al., 2018).Aggregation across answers provided by different crowd workers thus becomes one primary focus for crowdsourcing EQA datasets.Major voting is a simple and widely adopted aggregation method (Zheng et al., 2017) which elects answers that most crowd workers agree with.However, most of these major voting based methods are for categorical labels where the label space is small enough such that workers will more likely produce the same label (Passonneau and Carpenter, 2014  Figure 2: System overview and an example of automatic answer aggregation.Crowd workers are asked to label answer spans in passages for the given questions.If they achieve consensus, the QA pairs are used to fine-tune the natural language inference (NLI) based answer correctness evaluation model and the question answering (QA) model.Then we sort the non-consensus answers based on their encoding using a pre-trained language model (PLM), the answer correctness (β i,k ) and the question answering confidence (γ i,k ).Lakkaraju et al., 2015;Nguyen et al., 2017;Zhang et al., 2021a).They cannot apply to this EQA task where the answer candidates are word spans rather than a limited number of categorical labels, due to the huge number of words in the dictionary.There are some methods for automatic aggregating text sequences (Li, 2020;Li and Fukumoto, 2019), but they only apply to free text sequence tasks such as translation.Unlike free text sequence tasks, answer candidates are word spans within context passages and their quality is related to both the question and the context passage.The previous methods do not consider these dependencies.Therefore, answer aggregation for EQA is commonly performed by having a second group of workers selects and verify answers (Trischler et al., 2016;Welbl et al., 2017).
As the example in Figure 2 shows, crowd workers provide three distinct answer spans for the same instance.Another three crowd workers are then asked to vote for each answer annotation.Answer 3 got 2 votes and is selected as the ground-truth answer for the question.This method requires more resources and human efforts.
In this paper, we first model the candidate answer as a text sequence aggregation problem (Li and Fukumoto, 2019).Previous methods aggregate the best answer based on inter-answer distances of their vector representation.As answers for EQA are word spans within context passages, we adapt previous methods by presenting answers using contextual vector embedding produced by pre-trained language models (Wolf et al., 2020).In previous research, answer quality is evaluated by estimating worker reliability.However, we argue that in EQA, answer quality can also be evaluated based on its relation to the context passage and the question.We investigate answer quality evaluation from both the view of question answering ( Answer Confidence measure) by using QA models and from the view of answer verification (Answer Correctness measure) by using natural language inference (NLI) models.We further propose a novel joint framework to incorporate the answer quality measures with the inter-answer distances based answer aggregation methods for EQA.
With this work we make following contributions: • We propose a simple yet effective novel aggregation framework for aggregating crowdsourced answer annotations for EQA.
• We explore two answer quality measures Answer Confidence and Answer Correctness using weak heuristic question answering signal and NLI models and illustrate their effectiveness.
• The comprehensive experiments on a real largescale crowdsourced QA dataset suggest the effectiveness of the proposed answer quality measures and the proposed answer aggregation methods.
The results show that our framework can effectively leverage the rich information of context passage, questions and answer candidates for an-swer aggregation and achieve an improvement of around 15% on precision to baseline methods.

Crowdsourcing for QA Dataset Creation
Quality control in crowdsourcing has attracted intensive research (Snow et al., 2008;Kazai et al., 2011;Yang et al., 2019;Geva et al., 2019;Sayin et al., 2021).To reduce the noises of crowdsourced data, each data instance is commonly assigned to multiple workers to crate redundant annotations to infer the hidden ground truth by aggregation (Trischler et al., 2016;Yang et al., 2018;Talmor et al., 2018).In contrast to classification or categorical crowdsourcing tasks (Sun et al., 2014;Nguyen et al., 2017;Zhang et al., 2021a;Simpson et al., 2020;Lin et al., 2021) which have small label spaces, it is harder for crowd workers to achieve consensus on the answer for the same question.What signals the disagreement contains and how to effectively use them is an interesting research question (Aroyo and Welty, 2015;Northcutt et al., 2021).Most existing work on this question focuses on classification problems.Some work (Min et al., 2019;Chen et al., 2022) found that it is possible to use noisy answers as week supervision signals to improve QA performance especially in low-resource domains.However, they still rely on the existence of ground-truth answers which is obtained by crowdsourcing.In practice, multistage methods are commonly adopted for answer aggregation in QA (Trischler et al., 2016;Welbl et al., 2017;Kwiatkowski et al., 2019).For example, a four-stage collection process is utilized for collecting NewsQA (Trischler et al., 2016).Each item is assigned to multiple crowd workers(avg.2.73) to make answer annotations.Then another group (avg.group size is 2.48) is asked to validate distinct answer annotations collected in the previous stage).The Google Natural Questions dataset (Kwiatkowski et al., 2019) evaluates nonnull answer correctness with consensus judgments from 4 "experts" and the k-way annotations (with k = 25) on a subset.This approach leads to more cost of human efforts, time and money.

Crowdsourced Text Sequence Aggregation
Majority Voting is the most common and simplest aggregation method.It assumes most workers have comparable accuracy and reliability on the task.Thus some workers will produce the same answer for the same question, especially for categorical label tasks where the label space is small enough.However, it can perform poorly on complex sequence labeling tasks such as translation, summarization, and question answering.The number of words in the dictionary is so huge that it is difficult for workers to produce the same answer so that the ground truth answer can be found.Therefore multistage crowdsourcing patterns are used to resolve disagreements by selecting, verifying, or correcting answers like the fore-mentioned methods in the last subsection.Several automation methods have been proposed to reduce human labor.(Li and Fukumoto, 2019;Li, 2020) converted the answer texts into embeddings and extracted the potential optimal answer by estimating the embeddings of the true answer, considering both worker reliability and sequence representation.(Braylan and Lease, 2020) proposed a single, general annotation and aggregation model by modeling label distances to support diverse tasks such as translation and sequence labels.(Braylan and Lease, 2021) proposed to perform answer aggregation on complex annotations such as sequence labeling and multi-object image annotation by matching and merging different labels.Although the proposed methods have achieved great advantage in complex answer aggregation, little research focuses on the question answering crowdsourcing.

Problem Definition
For the extractive answer labeling task, each instance D i assigned to crowd workers is a tuple containing a context passage P i and a question Q i , i.e.D i = (P i , Q i ).The worker k is asked to select a word span A i,k from the context passage ), s, e indicates the start and end position of the answer in the passage, or NULL if no answer is present in the passage.Then we get a set of answers for question The answer aggregation model aims to select one answer from A i as the golden answer or reject all answers.In this work, we focus on designing an effective automation answer aggregation model to reduce human labor for multi-stage answer selection and verification, especially when none of them agree with each other.We achieve this goal by making a ranked list of all answers, so the answers with the highest evaluation score are ranked in front.

Text Sequence Aggregation for Answer Aggregation
As word spans from context passages, we first model the answer aggregation problem as a free text sequence aggregation problem and adopt the free text sequence aggregation methods Sequence Major Voting (SMV) and Sequence Maximum Similarity (SMS) on it (Li and Fukumoto, 2019).These methods perform text sequence aggregation based on answers' vector representations.
Answer Representation Different from the text sequence aggregation problems like translation, the answer correctness depends not only on the answer word span, but also on its context.Therefore, to produce a single vector representation of each answer, instead of encoding the answer independently, we get the answer's contextual embedding by encoding the passage containing the answer with transformers-based pre-trained language models.Then we use the mean value of all answer token embeddings as the embedding of the answer.Formally, we define the passage which consists a sequence of words as j=1 (with |P i | being the length of the passage and p j being the tokens in the passage), the language model as E and the token-wise encoding as: then the answer representation â i,k is produced by: âi,k = mean({p A s i,k : pA e i,k }) Sequence Majority Voting (SMV) by Li and Fukumoto (2019) is the direct adaptation of majority voting to the sequence label problem.SMV estimates the true answer embedding êi as the mean vector of all answer vector representations: and ranks answer candidates according to their similarity to êi and extracts the golden answer ẑi as the answer candidate with the most semantic similarity to êi : Sequence Maximum Similarity (SMS) (Li and Fukumoto, 2019).SMS method was first proposed for unsupervised ensemble of outputs of multiple text generation models (Kobayashi, 2018).It selects the gold output by selecting a majority-like output close to other outputs by using cosine similarity, which is an approximation of finding the maximum density point by kernel density estimation.Li and Fukumoto (2019) adopts SMS for crowdsourcing translation data which are generated by crowd workers instead of text generation models.However, they only use it on free text sequences.In this paper, we further adopt it to extractive QA task.We produce answer representation as fore-mentioned, and extract the golden answer ẑi as the answer candidate with the largest sum of similarity s i,k with other answer annotations of the same question:

Answer Quality Aware Answer Aggregation
The answer representations concentrate on answer contextual representation only, but the quality of each answer also depends on whether it can answer the question based on the context passage.The answer text sequence aggregation methods cannot fully utilize the rich information of both the context and question.Therefore, we further propose to aggregate crowdsourced answers in an answer quality aware way.We first propose to evaluate answer quality from the view of question answering model (Answer Confidence) and the view of answer verification model (Answer Correctness).Due to the lack of labeled data for training the QA and NLI models, prediction of these models are noisy and inaccurate.However, they can still provide hints on answer quality.Then we propose a novel aggregation method to strengthen the influences of possible high-quality answers (ACAF-SMS/SMV).and A e i,k .Therefore, the geometric average of these start position probability (Pr s (s|P i , Q i )) and end position probability (Pr e (e|P i , Q i )) distributions can be used as a heuristic of the confidence of the answer prediction.Formally, We define the answer confidence γ i,k as follows:

Answer
where w is search window size.
Answer Correctness (AC) QA models often lack the ability to verify the correctness of the predicted answer (Chen et al., 2021).One way to address this issue is to reformulate it to a textual entailment problem (Harabagiu and Hickl, 2006;Richardson et al., 2013;Chen et al., 2021) by viewing the answer context as the premise and the QA pair as the hypothesis.Then we use a natural language inference (NLI) system to verify whether the candidate answer proposed by crowd workers satisfies the entailment criterion.We use the transformers-based pre-trained sequence classification model for answer correctness verification.We treat the answer candidate as a short text sequence (answer-text), and formulate the input to the model in the format " [CLS] question [SEP] passage [SEP] answer-text [SEP] ".We truncate passages longer than the maximum 512 tokens and only keep the sentences containing the answer span.The embedding of the [CLS] token is used as the pooling encoding of the sequence, and a linear classification layer has performed the encoding.Finally, according to the passage, we use the sof tmax function to get the final probability that an answer candidate is correct.
Above, V represents the NLI model to verify the answer correctness.β i,k is the probability that the answer A i,k to question Q i is correct.
We then propose to combine the answer confidence and the answer correctness probability for answer quality evaluation.Assuming these two measures are complementary, to make the method simple, we combine them as simple sum: (6)

The Joint Method (ACAF-SMS/SMV)
We propose to join NLI model, QA model and contextual answer vector representations for answer aggregation by incorporating the answer correctness probability and answer confidence with sequence aggregation methods SMV and SMS to strengthen the influence of high-quality answers further.The joint sequence majority voting (ACAF-SMV) method computes the answer aggregation measure s i,k as: and the joint sequence maximum similarity (ACAF-SMS) method as: The AF-SMS algorithm and AF-SMV algorithms are similar to the above mentioned methods by replacing answer correctness probability β i,k with answer confidence γ i,k or r i,k .Figure 2 illustrates the proposed method.
4 Experimental Setup

Dataset
We evaluate the proposed method with the NewsQA dataset because it provides all crowdsourced raw answer annotations.The creation process of NewsQA demonstrates the challenges of QA dataset crowdsourcing and the importance and necessity of answer aggregation.Answers in the NewsQA are collected through a two-stage process: the primary stage (answer sourcing) and the validation stage.In the primary stage, each question solicits answers from avg. 2.73 crowdworkers.56.8% of questions have consensus answers between at least two answers on the primary stage.37.8% of questions got consensus answers after the validation stage.Crowdworkers do not come to a consensus for the rest 5.3% questions.
In this paper, we split NewsQA into four subsets: the primary consensus (Primary-C) set, which contains all passages, questions and their answers from the training set that achieve answer agreement on the primary stage; the primary non-consensus (Primary-NC) which contains all passages, questions and answer candidates that only achieve agreement after an additional round of answer validation from the training set; test consensus (Test-C)  set which contains passages, questions and answers that achieve consensus from the test set, and the test non-consensus (Test-NC) set which contains data items that only reach consensus after an additional round of answer validation from the test set.Figure 3 shows the boxplot of the number of crowdsourced answers for each question.There are more than four distinct answers per question in non-consensus sets.The Primary-C and Test-C sets are gold answers that can be used for training and evaluating the NLI and QA models used for answer aggregation.The Primary-NC and Test-NC sets are used for evaluating the proposed method.Passages in the training set do not contain passages in the test set, making our evaluation generative.
Table 1 shows the statistics of our data.

Baselines
Random Selection (RS) The baseline is to rank answer annotations randomly for each question.
We report the RS performance as the average over five random trials.
Context-Free (CF) SMS/SMV This baseline is to produce answer representation by treating answers as free text sequences without considering the context passages, i.e., the original SMS/SMV methods proposed by Li and Fukumoto (2019).

Evaluation
For each question, we sort the answers by the proposed aggregation methods.We evaluate the results in terms of widely used rank-aware metrics, including Precision@1 (P@1), Recall@1 (R@1), Mean Average Precision (MAP) and normalized discounted cumulative gain(NDCG).We choose the implementation of the information retrieval evaluation toolkit Pytrec_eval (Van Gysel and de Rijke, 2018) library.
5 Results and Analysis

Effectiveness of Answer Quality
Evaluation Methods

Performance of AC on Answer Classification
We train the NLI model for producing AC using the BERT for sequence classification implementation from the Huggingface Transformers library (Wolf et al., 2020) on the Primary-C set.It achieves 80.65% in accuracy and 87.59% in F1 on the Test-C set.On the Test-NC set, it performs 62.57% in accuracy and 64.52%, which is much worse than its performance on the Test-C set.The results indicate answers to questions that achieve consensus on the first sourcing stage are relatively more distinguishable and show the difficulty of specifying the correctness of disagreed answers.Figure 4a and Figure 5 show that AC is an effective metric to distinguish correct and wrong answers, which achieves 0.70 in AOC.

Performance of AF on Answer Classification
We train the QA model using the BERT-QA implementation from the Huggingface Transformers library on the Primary-C set and adopt the exact match (EM) and F1 score (F1) to evaluate its performance.The QA model achieves 27.94% and 60.89% in EM and F1 respectively on the Test-C set.In contrast, its performance on the Test-NC   set is 9.15% and 37.22% in EM and F1, which is much worse than performance on Test-C and demonstrates the difficulty of automatically answering these questions.Although its performance is poor due to the lack of enough training data, we observe that the AF score is an effective metric for correct answer classification as shown in Figure 4b and Figure 5 and achieves 0.71 in AOC, which is slightly better than AC.The combination of AC and AF (AC+AF) improves answer classification performance by up to 4% by a simple sum.
Performance of Answer Quality Evaluation on Answer Aggregation In Table 2, the rows AC, AF and AC+AF show the experimental results of performing answer aggregation by ranking answers according to AC, AF or by combining them(AC+AF).AC and AF have comparable performance; both achieve over 57% on P@1 and around 10% improvement over baselines, which shows the effectiveness of the proposed signals.By combining the NLI model prediction and the QA model heuristic signal, we can further improve the P@1 performance by around 3% on both Primary-NC and Test-NC sets, which shows the complementary strengths of the two signals.

Effectiveness of Answer Text Sequence Aggregation
As shown in Table 2, SMV and SMS can achieve similar performance to AC and AF by using the pretrained BERT-base model as encoder without any fine-tuning.This suggests the effectiveness of modeling answer aggregation for extractive QA task as a sequence answer aggregation problem.These methods outperform the context-free sequence aggregation baselines by about 10%, which proves the importance of contextual embedding.Since both SMV and SMS are based on the latent semantic similarity among answer candidates, the effectiveness of these methods implies the crowdsourced answers bear some common knowledge or contextual information which can be further explored.
We then conduct experiments by combining AC, AF with SMS and SMV separately (AC-SMV, AF-SMV, AC-SMS and AF-SMS).Results in Table 2 show that the proposed joint methods achieve around 3% absolute performance improvement on P@1, around 5% on R@1 than using SMS and SMV only and similar to AC+AF (only slightly worse).By combining AC+AF with SMS or SMV (ACAF-SMS / ACAF-SMV), the system performance is further improved by around 2% on P@1 and around 1% on other metrics.These findings first suggest the effectiveness of the joint aggregation method.They also demonstrate that the system can achieve better performance by combining unsupervised contextual answer representation and the weak learned signals.

Case Study
As shown in Table 4, we conduct a case study to examine the performance of the proposed framework.In this case, AC, AC+AF and SMS suggest waste is the correct answer.However its answer confidence is very low(0.0025).AF points great pacific garbage patch that stretches is the best answer.Only ACAF-SMS ranks the golden answer of the pacific ocean as the best answer, even though the AC and AF scores of this answer are not the highest.

Conclusion
In this paper, we propose a novel answer annotation aggregation method for EQA crowdsourcing.We show that without any fine-tuning, our methods can achieve comparable performance with the trained QA and NLI model using  4: An example from NewsQA dataset.There are 7 different answer annotations for the question.Some of the answers are overlapped.For each answer we report its ranking scores with AC AF SMS ACAF-SMS .
of the proposed method.The proposed methods outperform the baseline single metric method by around 16% absolute improvement on P@1 and 10% improvement on other ranking metrics.For future work, we will further explore methods incorporating crowd worker reliability and question answerability for better answer aggregation.We will also explore the applicability of our approaches to other tasks that deal with collecting extractive texts (DeYoung et al., 2020;Zhang et al., 2021b).

Limitations
While many automatic answer aggregation methods take crowd worker's reliability into consideration (Tian and Zhu, 2015;Li and Fukumoto, 2019), to keep the proposed framework simple and concise, we focus on the influence of answer quality and ignore the worker reliability.Moreover, we only use NewsQA to evaluate the proposed method.Although it is possible to consider more real or simulated datasets, as shown by the experiments on SQuAD and Natural Questions in Appendix A.3, NewsQA is the only large extractive QA dataset that provides all actual annotations to the best of our knowledge.Besides, this paper assumes there is only one correct answer for each question, while it is possible that there are multiple correct answers in some applications.We will explore crowd worker reliability aware answer aggregation methods and extend our work to multi-answer settings in future research.
Table 7: Performance of answer agreement on Primary-NC and Test-NC using the BERT-base-uncased model in terms of Exact Match (EM) and F1.

A.3 Answer Aggregation Results on Other Datasets
SQuAD and Natural Questions datasets only provide multiple annotations for dev sets.We performed experiments on by treating the training set as Primary-C and selecting questions with multiple different annotations and one consensus answer as Primary-NC.To train the NLI models needed for answer verification, besides the ground truth answers, we create negative answers by sampling different word spans with the same named entity types if possible, or word spans with the most similar part-of-speech(POS) structures.
Quality EvaluationAnswer Confidence (AF) We use BERT-QA(Devlin et al., 2018) as our QA model.It consists of two parts, the BERT encoder and the answer classifier.The answer classifier predicts the distributions of the start position and the end position separately based on the outputs of the BERT encoder.As argued byXie et al. (2020);Zhu and Hauff (2021), the QA model should be quite confident about the prediction of answer start/end span to the answerable question.Thus the prediction probability distribution should peak on both A s i,k

Figure 3 :
Figure 3: Number of answer annotations for each question in the four datasets we use.

Figure 4 :
Figure 4: Cumulative answer correctness(a) and answer confidence(b) distributions on correct answers and incorrect answers.

Figure 5 :
Figure 5: ROC Curve and area under the cure (AOC) of different answer classification methods, including answer correctness (AC), answer confidence (AF) and their combination.
limited training data.We introduce a novel algorithm for combining the NLI model, QA model and contextual text embedding for answer text sequence aggregation.The experiments on a real large-scale crowdsourced EQA dataset show the effectiveness and stability contextThe American photographed the remains of albatross chicks that had died from consuming plastic waste found in the surrounding oceans.According to the artist, not a single piece of plastic in any of the photographs was moved, placed or altered in any way.The nesting babies had been fed the plastic by their parents, who collected what looked to them like food to bring back to their young.From cigarette lighters to bottle caps, the plastic is found in what is now known as the great Pacific garbage patch that Conservative talk ... a better conclusion than a white male who hasn't lived that life."One top GOP senator said he wants more than an explanation."I think she should apologize, but I don't believe any American wants a judge on the bench that's going to use empathy or their background to punish someon."She's been called the equivalent of the head of the Ku Klux Klan by Rush Limbaugh; .... yor's appellate court decision against a mostly white group of firefighters who say they were discriminated against after a promotion test was thrown out, because critics said it discriminated against minority firefighters.But legal experts have said her full record on race isn't that controversial --in 96 race-related cases decided by Sotomayor on the court of appeals, ...
Q: What did the GOP leaders say?

Table 1 :
Statistics of the datasets; number of passages |P |; number of answerable questions |Q A |; number of unanswerable questions |Q U |; number of correct answers |A C | and number of wrong answers |A W |.

Table 2 :
Experimental results of baselines and the proposed framework of answer agreement on Primary-NC and Test-NC set using the BERT-base-uncased model.

Table 3 :
Results of answer aggregation using different encoders.