Can Large Language Models Fix Data Annotation Errors? An Empirical Study Using Debatepedia for Query-Focused Text Summarization

,


Introduction
Text summarization is a natural language processing technique that involves generating a concise and coherent summary of a longer piece of text while preserving its most important information (Yao et al., 2017).Query-focused text summarization is a specific type of summarization that generates a summary of the given text that is focused on answering a specific question (Laskar et al., 2020c) or addressing a particular topic, rather than providing a general overview of the text.(Baumel et al., 2018;Goodwin et al., 2020;Su et al., 2020;Xu and Lapata, 2021;Laskar et al., 2020aLaskar et al., ,b, 2022)).
One widely used dataset for this task is the Debatepedia dataset that consists of arguments and counter-arguments on conversational topics (Nema et al., 2017).The query-focused summarization of argumentative text is a challenging task that has gained increasing attention in recent years due to its potential applications in various domains, such as policy-making, journalism, and legal reasoning.
However, it has been recently found that the quality of the Debatepedia dataset that is widely used for this task is limited by noise, with many of the queries in this dataset having no relevance with the source document (Laskar et al., 2022).Since Debatepedia is a rich source of argumentative text on controversial topics that can serve as a valuable resource for the development and evaluation of summarization models, in this paper, we present a novel methodology to clean the Debatepedia dataset via re-annotation of its queries to make it a useful resource for query-focused abstractive summarization.Our data annotation approach leverages large pre-trained language models (Devlin et al., 2018;Brown et al., 2020;Ouyang et al., 2022), such as ChatGPT (OpenAI, 2023) and PaLM-2 (Anil et al., 2023), that have demonstrated impressive capability of generating fluent and coherent text (Laskar et al., 2023a).Using these LLMs, we regenerate the queries in the Debatepedia dataset that are more likely to have no relevance to the document and the summary.More specifically, this paper aims to investigate whether LLMs can be utilized to fix the existing issues in the Debatepedia dataset.Our extensive experiments show that utilizing rule-based filtering to eradicate noisy instances alongside leveraging the generative power of LLMs to regenerate the irrelevant queries leads to performance improvement in terms of both query relevance and summary generation quality.We will make this LLM-annotated cleaned version of Debatepedia publicly available3 .
Among the datasets mentioned above, one notable exception is the Debatepedia dataset since it requires generating summaries from a document containing argumentative text (i.e., arguments and counter-arguments).However, it has been found recently that many samples in the Debatepedia dataset are not actually query oriented while models that are trained without considering the query relevance could achieve almost similar performance as the query-focused summarization models (Laskar et al., 2022).Thus, there remains a scarcity of datasets specifically tailored to generate query-focused summaries of argumentative texts.
Though some studies (Abdullah and Chali, 2020) have attempted to generate the queries in generic summarization datasets (e.g., CNNDM (Nallapati et al., 2016)), we find that these queries are generated by directly extracting words from the reference summaries, leading to unexpected access to the keywords in the reference summaries for the summarization models.LLMs have received a lot of attention recently due to their impressive language generation capability -ensuring high fluency, coherence, and grammatical correctness on the generated texts (Laskar et al., 2023a;Qin et al., 2023;Bang et al., 2023;Yang et al., 2023;Wang et al., 2023;Kocoń et al., 2023).More importantly, ChatGPT like LLMs also demonstrated impressive capability for data annotation (Wang et al., 2021;Ding et al., 2022;Gilardi et al., 2023).To this end, in this paper, we study how to fix the queries in Debatepedia using LLMs to construct a cleaned version of the dataset to make it suitable for queryfocused summarization of argumentative texts.

Our Annotation Methodology
Debatepedia is a publicly available dataset of arguments and counter-arguments on debate topics, proposed by Nema et al. (Nema et al., 2017).It contains about 13K query-document-summary pairs.The average number of words per document, summary, and query in the Debatepedia dataset is 66.4, 11.16, and 9.97, respectively.The dataset covers a wide range of topics, such as politics, sports, and technology, and has been extensively used in recent years to build query-based summarization models for argumentative text (Laskar et al., 2022).However, the quality of Debatepedia as a dataset for query-based summarization has lots of limitations (see Table 5 in Appendix A.1 for some examples), as it has been found recently that many queries in this dataset are not relevant to the document (Laskar et al., 2022).To address these limitations, we propose a methodology for cleaning the Debatepedia dataset via leveraging two popular LLMs: ChatGPT (Laskar et al., 2023a) and PaLM-2 (Anil et al., 2023), as annotators.In this regard, we initially explored various techniques to identify how to effectively sample the noisy instances, and subsequently, we regenerated the queries for the sampled instances.We denote our ChatGPT and PaLM annotated versions of Debatepedia (DP) for Query Focused Abstractive Summarization as the CQ-SumDP and the PQSumDP, respectively.

Data Sampling
We explore two approaches for data sampling.In one approach, we study whether only fixing the queries in the Debatepedia dataset via leveraging LLMs for query regeneration could address the issues in the Debatepedia dataset or not.For this purpose, we ask LLMs to identify the instances in the Debatepedia dataset where the queries seemed irrelevant.In our other approach, we first sample data instances based on some filtering rules by excluding instances that are less relevant for queryfocused summarization, and then we ask LLMs to re-generate the queries from these sampled instances where the queries looked irrelevant.Our prompt for data sampling using LLMs is shown in Table 1(a).Below, we describe these approaches.
(i) LLM-based Data Sampling without Filtering: In this approach, we use the full Debatepedia dataset to find the irrelevant queries using LLMs.For this purpose, we provide each instance of Debatepedia to the LLMs to determine if the query is Below, we provide a query, a document, and the query-focused summary of the given document.Identify whether the query is relevant to the summary?Answer as either yes or no.

Query: [QUERY] Document: [DOCUMENT] Summary: [SUMMARY]
A document along with its summary are given below.Write down the most reasonable query relevant to this document-summary pair?Document: [DOCUMENT] Summary: [SUMMARY] Table 1: Prompts for LLMs: (a) data sampling for query regeneration, and (b) regenerating the sampled queries.
relevant to the document/summary.However, we find a significant difference between LLMs in this task.While PaLM-2 only identifies 659 queries as irrelevant (612/19/28 in train/valid/test sets, respectively), ChatGPT identifies 6435 queries as irrelevant (5697/316/422 in train/valid/test sets, respectively), out of 13719 samples.
(ii) LLM-based Data Sampling via Filtering: In this approach, instead of cleaning the Debatepedia dataset by only fixing the queries, we also follow some rules to first filter out some irrelevant instances from the dataset to address the existing limitations in Debatepedia (Laskar et al., 2022), such as smaller-sized documents, close-ended questions, etc.Since for the smaller-sized documents, the reference summaries are mainly the overall generic summary of the document where the additional query does not help, we aim to exclude smallersized documents to ensure that the reference summaries are more query-focused.This also helps us to address the noisy scenario in the dataset when the reference summary length is longer than the document length.Based on manual analysis, we find that a minimum length of 75 words for each selected document at least ensures a document where the query could play a role in the summary generation.To also address the issue of short summaries that looked like answers to closed-ended questions,we exclude instances where the length of the summary is shorter than 5 words.This helps us to clean the dataset in a way such that instead of having a dataset with close-ended questions and short answers, we propose a dataset consisting of concise but coherent summaries.This results in a filtered version of the dataset which is smaller in size, consisting of 5291/309/405 instances, in comparison to the original dataset4 containing 12000/719/1000 instances, in train/valid/test sets, respectively.We also find that ChatGPT and PaLM-2 identified 2171/120/145 and 218/6/6 queries as irrelevant in the training, validation, and test sets, respectively.
Below, we demonstrate how we utilize LLMs for query regeneration.

Using LLM for Query Regeneration
We concatenate the document and the reference summary together and give as input to the LLMs for query regeneration.Our sample prompt for this task can be found in Table 1(b).While we could ask LLMs to generate both the query and the querybased summary by only giving the document in the input prompt, we did not do so since it is found that LLMs like ChatGPT tend to generate longer summaries (Laskar et al., 2023a;Qin et al., 2023) while the resulting dataset could become a fully synthetic dataset.Thus, we use both the document and the summary as input and only regenerate the queries while keeping the original reference summaries intact.We find that the regenerated queries using ChatGPT and PaLM-2 only have 15.2% and 11.4% word overlaps, respectively, with the gold summaries, in comparison to the 10.6% word overlaps in the original Debatepedia dataset.

Experimental Results
In this section, we present our experimental findings.We denote the version of our dataset where we did not apply any filtering as the unfiltered version, whereas we denote the version of our dataset where we also utilize filtering while sampling data instances based on applying some rules as the filtered version.For ChatGPT, we use the gpt-3.5-turbo-03015model; while for PaLM-2, we use the text-bison@0016 model.We finetune the following models to benchmark the performance in our re-annotated versions of Debatepedia since these models achieved impressive performance in query-focused abstractive summarization

Effectiveness of LLMs for Data Cleaning
In this section, to investigate the effectiveness of using LLMs for data cleaning, we evaluate the performance of models trained on different LLMannotated versions of the Debatepedia dataset in an out-of-domain dataset for the query-focused abstractive summarization task.This is done to ensure that all models are evaluated on the same evaluation set.In this regard, we use the development set of the QA-NLG version of the MS-MARCO (Wang et al., 2018) dataset (12467 samples).We follow the similar settings of Laskar et al. ( 2022) by only considering the gold passage as the source document, and after combining the passage with the query we give the concatenated text as input to the models.The results of all three models (BART, T5, Pegasus) on MS-MARCO that are fine-tuned on the respective versions of Debatepedia are shown in Table 2.We observe based on our experimental results that the domain generalization performance is much better when the CQSumDP/PQSumDP versions of the Debatepedia dataset are used in comparison using the Original Debatepedia dataset.While comparing ChatGPT and PaLM as data annotators, we observe that models trained on CQSumDP perform better than PQSumDP.Moreover, we find that models trained on the filtered version obtain better performance (with T5-Base achieving the best result), indicating the importance of cleaning the Debatepedia dataset by excluding noisy instances, alongside utilizing LLM-generated queries.
Qualitative Evaluation of Model Generated Summaries: We sample 10 summaries generated by each model (BART, T5, Pegasus) on the MS-MARCO dataset to conduct human evaluations for our best-performing approach, the CQSumDP (filtered version), and the baseline Original Debatepedia.In our human evaluation, we ask humans to score between 1 to 5 for the factual consistency and the coherence of the summaries generated by different models for the given queries.The average coherence and factual consistency scores for models trained on CQSumDP (filtered) are 3.4 and 3.3, respectively; in comparison to the average coherence and factual consistency scores of 3 and 2.6, respectively, for the Original Debatepedia.This further establishes the effectiveness of using LLMs as annotators to make a more suitable dataset for query-focused text summarization.
Qualitative Evaluation of LLM Generated Queries: We sample 100 instances and ask three human evaluators to choose between the ChatGPT and PaLM-generated (see Appendix A.4 for some examples) queries that they prefer based on the conciseness and the relevancy of the query.We find that in 66% cases (via majority voting), the ChatGPT-generated queries were preferred.
LLM as Query Relevancy Classifier: To measure the capability of LLMs in classifying whether the query is relevant to the document/summary, we sample 100 instances and evaluate using three human evaluators to find if they also agree with the classification done by LLMs.We find based on majority voting that the precision for the classification task for PaLM-2 is 75%, while for ChatGPT is 63%.This trend of PaLM-2 outperforming Chat-GPT in discriminative tasks (e.g., classification) while being underperformed in generative tasks is also observed in recent studies (Jahan et al., 2023).
Ablation Studies: To further investigate the usefulness of LLM-generated queries, we conduct the following ablation tests using the best-performing model, T5-base, on MS-MARCO (see Table 4).
(i) Remove LLM-generated Query: Here, we evaluate the performance of the T5 model by finetuning it on the filtered version of Debatepedia without incorporating any query relevance.We find based on the average score across different metrics that the performance for T5 is dropped by 9.53% on average, in comparison to the T5 model fine-tuned on the CQSumDP (filtered) dataset.
(ii) Replace LLM-generated Query: Here, we evaluate the performance by fine-tuning T5 using the original query instead of the LLM-generated query in the filtered version of Debatepedia.Based on the average scores achieved by the T5 model, the performance is dropped by 3.57% on average, compared to T5 fine-tuned on CQSumDP (filtered).Cost and Time Efficiency: Recently, it was found that LLMs could significantly reduce the labeling cost without sacrificing the model's performance much, making it possible to train models on larger datasets without the need for human labeling (Wang et al., 2021;Ding et al., 2022;Liu et al., 2023).In this work, we observe that Chat-GPT/PaLM APIs could generate about 15 queries on average per minute, which should be much faster than using human annotators, since humans may need some time to come up with the most effective query for the given document-summary pairs.This makes LLMs to be more effective for annotation.

Performance Benchmarking on Different Versions of Debatepedia
In this section, we benchmark the performance of various LLM-annotated versions of the Debatepedia dataset.We present our results in Table 3 to find that all three models perform better in the CQ-SumDP dataset in comparison to their performance on the PQSumDP.This gives further indication that the queries generated by ChatGPT are more helpful in improving the model performance.While comparing between different models, we find that in both the filtered and the unfiltered versions, the best performance is achieved by the BART model.

Conclusions and Future Work
In this paper, we study how to effectively leverage LLMs to construct a cleaned version of the Debatepedia dataset to address the existing limitations in this dataset in order to make it suitable for queryfocused text summarization.Based on extensive experiments and evaluation, we demonstrate that our proposed data re-annotation approach using LLMs (especially ChatGPT) results in a cleaner version of Debatepedia that is found to be more effective for the query-focused summarization task in comparison to the original dataset.In the future, we will explore whether few-shot examples with LLMs lead to better performance.Our re-annotated versions of Debatepedia will also be made publicly available here: https://github.com/tahmedge/CQSUMDP.

Limitations
ChatGPT (GPT-3.5)and PaLM models are continuously upgraded by OpenAI and Google.Thus, it may not be possible to reproduce the same queries using these models.However, this also mimics the real-world scenario as different human annotators may write different queries (e.g., in many text summarization datasets, there can be multiple gold reference summaries written by different human annotators).However, similar to the work of Guo et al. (2023), we also notice that this difference is very small.Therefore, we also generate only one query for each example.Though a new version of ChatGPT called GPT-47 has been recently released which may generate more powerful queries, in this work, we did not utilize GPT-4 as it is quite expensive to use than the original ChatGPT (i.e., GPT-3.5) while being significantly slower.Nonetheless, future work may compare with other more powerful LLMs (including GPT-4) for data annotation.

Ethics Statement
Since this paper only utilizes LLMs to generate the queries for the given document-summary pairs, it does not lead to any unwanted biases or ethical concerns.However, all the responses generated by ChatGPT and PaLM are still manually checked by the authors to ensure that the LLM-generated queries in the cleaned version of the dataset do not pose any ethical concerns or unwanted biases.Only a publicly available academic dataset was used that did not require any licensing.Thus, no personally identifiable information has been used while utilizing LLMs to fix the queries in the Debatepedia dataset.All the human evaluators were also paid above the minimum wage.

A.1 Debatepedia Dataset Limitations
Based on a randomly sampled 100 instances, it has been found in a recent study (Laskar et al., 2022(Laskar et al., , 2023b) ) that: • 52% of the queries in this dataset have no relevance to the documents or the summaries, as demonstrated in Table 5.
• Though, many queries in this dataset are relevant to the documents but the summaries are more of generic due to shorter document length.The average length of the document in this dataset is only 66.4 words.
In addition, many instances in this dataset only contain one word summary (see Example 2 in Table 5) for a given query that appears both in the training and evaluation sets, which may also help the model to memorize such words for similar queries during the training phase.These issues may lead to an unexpected increase in the ROUGE score when the model starts learning to reproduce those words in the summary during the evaluation phase.Furthermore, we also find some instances where the length of the summary is longer than the document length, which usually happens in short documents (see Example 3 in Table 5).

A.2 Example Prompt for Query Generation
One example prompt to re-generate the query using LLMs is shown in Figure 1.

A.3.1 Models
To evaluate the effectiveness of our ChatGPT annotated CQSumDP and PaLM annotated PQSumDP datasets, we fine-tune some state-of-the-art pretrained sequence to sequence models (Lewis et al., 2019;Raffel et al., 2019;Zhang et al., 2019;Goodwin et al., 2020).For this purpose, we concatenate the query with the document and give as input to these models to generate the query-focused abstractive summaries as this approach has shown impressive performance in the query-focused abstractive summarization task recently (Laskar et al., 2022).We describe these models below: Example 1: Query having no relevance with the document and the summary.
Query: Does an MBA enhance leadership skills?Document: Business schools might improve your quantitative presentation and communication skills.It might but get you thinking about ethical and strategy.But two years of case studies aren't go to turn you into a leader if you weren't died one.There's no learning charisma persuasiveness elegance or gut instinct.
Reference Summary: PhD will not improve cm factors of leaders.
Example 2: One word summary having no relevance with the query or document.
Query: Education : do child benefit from watching tv? Document: by watching news child can learn about geography politics advances in science -everything simply and later explained .child learn about real-life situation that happens on everyday basis which will benefit them in the future.
Reference Summary: News.
Example 3: The length of the summary is longer than the document with the query being irrelevant.
Query: activists : where do the keys activists and organizations stand ?Document: see an analyses of the article ... Reference Summary: philip martin of berkeley davis and michael teitelbaum the mirage of mexican guest workers nov/dec # foreign affairs .
Example 4: More of a close-ended question.
Query: friendships : does twitter harms relationships ?Document: twitter helps those stay in touches no matter how far they may be from each other .
Reference Summary: long-distance friendships .BART (Bidirectional and Auto-Regressive Transformer): BART (Lewis et al., 2019) is a pre-trained sequence-to-sequence model based on the encoder-decoder architecture that was pretrained on a large amount of diverse text data using the denoising auto-encoding technique to recover the original form of a corrupted document.The pre-training involved various objectives such as rotating the document, permuting sentences, infilling text, masking tokens, and deleting tokens.We use the pre-trained BART model since fine-tuning this model was found to be very effective in abstractive summarization (Laskar et al., 2022).

T5 (Text-to-Text Transfer Transformer):
The T5 model (Raffel et al., 2019) is a transformerbased model based on the BERT architecture.However, unlike traditional BERT models that classify input text into a specific category, the T5 model treats all tasks such as text classification, question answering, neural machine translation, and text summarization as a sequence-to-sequence problem using various pre-training objectives.After pretraining, the model is fine-tuned on many downstream tasks, achieving impressive performance across various datasets including summarization.
Pegasus (Pre-training with Extracted Gapsentences for Abstractive Summarization): Pegasus (Zhang et al., 2019) is a transformer-based pre-trained encoder-decoder model for abstractive summarization.Its pre-training objective involves generating summary like text from an input document.To achieve this, the PEGASUS model first selects and masks some sentences from the input document(s).It then concatenates these selected sentences to create a pseudo-summary.The model uses different approaches to select these sentences, such as randomly selecting a certain number of sentences, selecting the first few sentences, or computing the ROUGE-1 score between each sentence and the rest of the document to choose the topscoring sentences.This pseudo-summary is then used for self-supervised learning.By pre-training on large datasets using this approach, the model achieves impressive fine-tuning performance on downstream summarization datasets.

A.3.2 Implementation
We use the HuggingFace8 (Wolf et al., 2019)library to implement the baseline models for performance evaluation.Similar to the prior work, we concate- nated the query with the document to give as input to the pre-trained baselines (i.e., BART, Pegasus, T5).The pre-trained model is then fine-tuned using 4 NVIDIA V100 GPUs.The training batch size for BART was set to 16, while it was set to 4 for Pegasus and T5.The other hyperparameters were similar for all models, with the learning rate being set to 2e − 3 and the maximum input (i.e., the concatenated query and document) sequence length being 150 tokens.The minimum and the maximum target (i.e., the generated summary) sequence lengths were 5 and 25, respectively.A total of 10 epochs were run to fine-tune the pre-trained summarization models.We computed the ROUGE (Lin, 2004) scores in terms of ROUGE-1, ROUGE-2, and ROUGE-L using the Evaluate9 library to compare the performance of different models on the respective test set.As noted earlier, for ChatGPT, we use the gpt-3.5-turbo-0301model; while for PaLM, we use the text-bison@001model.

A.4 Qualitative Analysis of the Annotated Data
In this section, we do some qualitative analyses between the queries in the Original Debatepedia dataset as well as the queries generated using LLMs in our proposed CQSumDP and PQSumDP versions of the Debatepedia dataset.For our analysis, we collect a set of 3 samples from this dataset and present them in Table 6.While comparing between the queries in the first example in the table, we find that the original query is just one word length and very ambiguous, while the ChatGPT generated query is more descriptive and more relevant to both the document and the summary.For the second ex-ample, we find that even though the original query is descriptive, it does not have any relevance to the generated summary.Whereas both the ChatGPT and PaLM generated queries are very relevant to both the document and the summary (in this example, PaLM generated query is more descriptive).
For the third example, we find that the original query is related to "entrepreneurs".However, the document is about "product managers", not "entrepreneurs".Meanwhile, the ChatGPT and PaLM generated queries are also very relevant to the document and both LLM-generated queries are the same.This analysis further demonstrates the relevance of our LLM-generated query in comparison to the original query in Debatepedia.

Figure 1 :
Figure 1: Example Input to LLMs for Query Generation.

Table 2 :
Performance of different models on MS-MARCO when trained on respective versions of Debatepedia (DP).

Table 3 :
Performance of different models on various versions of Debatepedia.

Table 4 :
Ablation test result for the T5-Base model fine-tuned on Debatepedia (filtered) and evaluated on MS-MARCO.

Table 5 :
Some examples demonstrating the limitations in the Debatepedia dataset.