Claim Optimization in Computational Argumentation

An optimal delivery of arguments is key to persuasion in any debate, both for humans and for AI systems. This requires the use of clear and fluent claims relevant to the given debate. Prior work has studied the automatic assessment of argument quality extensively. Yet, no approach actually improves the quality so far. To fill this gap, this paper proposes the task of claim optimization: to rewrite argumentative claims in order to optimize their delivery. As multiple types of optimization are possible, we approach this task by first generating a diverse set of candidate claims using a large language model, such as BART, taking into account contextual information. Then, the best candidate is selected using various quality metrics. In automatic and human evaluation on an English-language corpus, our quality-based candidate selection outperforms several baselines, improving 60% of all claims (worsening 16% only). Follow-up analyses reveal that, beyond copy editing, our approach often specifies claims with details, whereas it adds less evidence than humans do. Moreover, its capabilities generalize well to other domains, such as instructional texts.


Introduction
The delivery of arguments in clear and appropriate language is a decisive factor in achieving persuasion in any debating situation, known as elocutio in Aristotle's rhetoric (El Baff et al., 2019). Accordingly, the claims composed in an argument should not only be grammatically fluent and relevant to the given debate topic, but also unambiguous, selfcontained, and more. Written arguments therefore often undergo multiple revisions in which various aspects are optimized (Zhang and Litman, 2015).
As detailed in Section 2, extensive research has been done on the automatic assessment of argument quality and the use of large language models on various text editing tasks. Yet, no work so far Humans should be allowed to explore DIY gene editing.
This technology could be weaponized.
This technology could be weaponized and harmful to human beings.
This technology could be used by criminals to create and weaponize bio-mechanisms.
This technology could be weaponized, so it is important to safeguard it from being weaponized. Figure 1: Examples of different optimized versions of an original claim found on the debate platform Kialo. All optimizations were generated by the approach proposed in this paper, using the debate topic as context.
has studied how to actually improve argumentative texts. However, developing respective approaches is a critical step towards building effective writing assistants, which could not only help learners write better argumentative texts (Wambsganss et al., 2021), but also rephrase arguments made by an AI debater (Slonim et al., 2021). In this work, we close the outlined gap by studying how to employ large language models for rewriting argumentative text in order to optimize its delivery. We start by defining the new task of claim optimization in Section 3, and adjust the English claim revision dataset of Skitalinskaya et al. (2021) for evaluation. This task requires complementary abilities. On the one hand, different types of quality issues inside a claim must be detected, from grammatical errors to missing details. If not all quality aspects can be improved simultaneously, specific ones must be targeted. On the other hand, improved claim parts need to be integrated with the context of the surrounding discussion, while preserving the original meaning as far as possible. Figure 1 shows three exemplary optimizations of a claim from the debate platform Kialo. The first elaborates what the consequence of weaponization is, whereas the second rephrases the claim to clarify what weaponizing means, employing knowledge about the debate topic. The third renders the stance of the claim explicit. We observe that different ways to optimize a claim exist, yet the level of improvement differs as well.
As an initial approach to claim optimization, we propose to combine the capabilities of large language models with quality assessment in a controlled generation (Section 4). First, a fine-tuned sequence-to-sequence model produces several candidate optimizations of a given claim. To optimize claims, we condition the model on discourse context, namely the debate topic and the previous claim in the debate. The key to finding the best candidate is to then rerank candidates with respect to three complementary quality metrics: grammatical fluency, meaning preservation, and argument quality. Such reranking remains understudied in generative tasks within computational argumentation.
In automatic and manual evaluation (Section 5), we demonstrate the effectiveness of our approach, employing fine-tuned BART (Lewis et al., 2020) for candidate generation. Our results stress the benefits of quality assessment (Section 6). Incorporating context turns out especially helpful for making shorter claims-where the topic of the debate is difficult to infer-more self-contained. According to human annotators, our approach improves 60% of all claims and harms only 16%, clearly outperforming generation without reranking.
To gain further insights, we carry out a manual annotation of 600 claim optimizations and identify eight types typically found in online debate communities, such as elaboration and disambiguation (Section 7). Intriguingly, our approach covers a variety of optimization types similar to human revisions, but we also observe limitations (Section 7).
To explore to what extent it generalizes to other domains, we also carry out experiments on instructional texts (Anthonio and Roth, 2020) and formal texts (Du et al., 2022) and find that it outperforms strong baselines and state-of-the-art approaches.
In summary, the contributions of this paper are: 1. a new task, claim optimization, along with a manual analysis of typical optimization types; 2. a computational approach that reranks generated candidate claims with respect to quality; 3. empirical insights into the impact and challenges of optimizing claims computationally. 1 1 Data, code, and models available at https://github.com/GabriellaSky/claim_optimization

Related Work
Wikipedia-based corpora have often been used in the study of editing and rewriting, including paraphrasing (Max and Wisniewski, 2010), sentence simplification (Botha et al., 2018), grammatical error correction (Lichtarge et al., 2019), bias neutralization (Pryzant et al., 2020), and controllable text editing (Faltings et al., 2021;Du et al., 2022). Similarly, WikiHow has served for summarization (Koupaee and Wang, 2018) and knowledge acquisition (Zhou et al., 2019). However, neither of these includes argumentative texts. Instead, we rely on data from Skitalinskaya et al. (2021), which consists of revision histories of argumentative claims from online debates. Whereas the authors compare claims in terms of quality, we propose and study the new task of automatically optimizing claim quality.
The key idea of our approach is to rerank multiple candidates generated by a language model. Prior work on reranking in generation hints at the potential benefits of such setup, albeit in different tasks and domains. In early work on rule-based conversational systems, Walker et al. (2001) introduced novel dialogue quality metrics to optimize template-based systems towards user satisfaction. Kondadadi et al. (2013) and Cao et al. (2018) ranked the best templates for text generation, Mizumoto and Matsumoto (2016) used syntactic features to rerank candidates in grammatical error correction. Recently (Yoshimura et al., 2020) proposed a reference-less metric trained on manual evaluations of grammatical error correction system outputs to assess generated candidates, while Suzgun et al. (2022) utilize pre-trained general-purpose language models to rerank candidates in textual style transfer tasks. However, reranking is still largely understudied in generation research within computational argumentation. The most related approach of Chakrabarty et al. (2021) reframes arguments to be more trustworthy (e.g., less partisan). It generates multiple candidates and reranks them based on entailment relation scores to the original text. Building on this, we rerank candidates based on various properties, including argument quality.
Understanding the editing process of arguments is crucial, as it reveals what quality dimensions are considered important. For Wikipedia, Daxenberger and Gurevych (2013) proposed a fine-grained taxonomy as a result of their multi-label edit categorization of revisions (Daxenberger and Gurevych, 2012). The taxonomy focuses solely on the editing actions performed, such as inserting, deleting, and paraphrasing. In contrast, Yang et al. (2017) identified various semantic intentions behind Wikipedia revisions, from copy editing to content clarifications and fact updates. Their taxonomy defines a starting point for our research. Not all covered intentions generalize beyond Wiki scenarios, though.
For the analysis of argumentative text rewriting, Zhang and Litman (2015) incorporated both argumentative writing features and surface changes. To explore the classification of essay revisions, they defined a two-dimensional schema, combining the revision operation (e.g., modify, add, or delete) with the component being revised (e.g., reasoning or evidence). Moreover, Afrin and Litman (2018) created a small corpus of between-draft revisions of 60 student essays to study whether revision improves quality. However, these works do not uncover the reasoning behind a revision operation and are more geared towards analysis at the essay level.
The corpus we use distinguishes three claim revision types: clarification, grammar correction, and linking to external resources (Skitalinskaya et al., 2021). However, we argue that this is too coarsegrained to represent the diversity of claim quality optimization. For refinement, we manually identify eight types of optimizations, allowing for a systematic analysis of claims improved automatically. The authors also compare the revision types to the 15 dimensions in the argument quality taxonomy of Wachsmuth et al. (2017). Many correlations were rather low, suggesting that the claim revision types are rather complementary to the dimensions. Primarily, they target the general form a well-phrased claim should have and its relevance to the debate.

Task and Data
This section introduces the proposed task and presents the data used for development and evaluation.

Claim Optimization
We define the task of computational claim optimization as follows: Task Given as input an argumentative claim c, potentially along with context information on the debate, rewrite c into an output claimc such that (a)c improves upon c in terms of text quality and/or argument quality, and (b)c preserves the meaning of c as far as possible.
While we conceptually assume that c is phrased in one or more complete sentences and that it has at least one quality flaw, the approaches studied later on do not model this explicitly. Moreover, note that a claim may be flawed in multiple ways, often resulting in n ≥ 2 candidate optimizationsC = {c 1 , . . . ,c n }. In this case, the goal is to identify the candidate c * ∈C that maximizes overall quality.

Data for Development and Evaluation
As a basis for the development and evaluation of approaches to the task, we build on the dataset of Skitalinskaya et al. (2021) -ClaimRev, consisting of 124,312 claims and their revision histories from the online debate platform Kialo. Each history defines a chain (c 1 , . . . , c m ), in which each claim c i is a revised version of the previous claim, c i−1 with 1 < i ≤ m, that improves c i−1 in terms of quality, which holds in 93% of all cases according to the authors.
From each revision chain, we derived all possible optimization pairs (c,c) := (c i−1 , c i ), in total 210,222. Most revisions are labeled with their intention by the users who performed them, rendering them suitable for learning to optimize claims automatically. 2 Overall, 95% of all pairs refer to three intention labels: clarification, typo/grammar correction, and corrected/added links. To avoid noise from the few remaining labels, we condensed the data to 198,089 instances of the three main labels. 3 For the final task dataset, we associated each remaining pair (c,c) to its context: the debate topic τ (i.e., the thesis on Kialo) as well as the previous claimĉ (the parent on Kialo), which is supported or opposed by c (see Figure 1). We sampled 600 revision pairs pseudo-randomly as a test set (200 per intention label), and split all other pairs into a training set (90%) and a validation set (10%). As the given labels are rather coarse-grained, we look into the optimizations in more detail in Section 7.

Approach
We now present the first approach to automatic claim optimization. First, candidate claims are generated that are pertinent to the context given and do not change the meaning of the original claim. Then, the candidates are reranked to find the optimal claim in terms of text and argument quality. Both steps are detailed below and illustrated in Figure 2.

Seq2Seq-based Candidate Generation
To generate candidates, we fine-tune a sequence-tosequence model on training pairs (c,c), by treating the original claim, c, as the encoder source and the revised claim,c as the decoder target. In a separate experiment, we condition the models on context information during fine-tuning to further optimize the relevance of the generated candidates. As context, the debate topic, τ , and the previous claim,ĉ are prepended to c, separated by delimiter tokens (Keskar et al., 2019;Schiller et al., 2021).
There may be multiple ways to improve c, especially when it suffers from multiple flaws, since not all flaws may be fixed in a single revision. To account for this, we first generate n suitable candidates,c 1 , . . . ,c n , among which the optimal one is to be found later (n is set to 10 in Section 5). However, the top candidates created by language models often tend to be very similar. To increase variety, we perform top-k sampling (Fan et al., 2018), where we first generate the most probable candidate (top-1) and then vary k with a step of 5 (e.g. top-5, top-10, etc).

Quality-based Candidate Reranking
Among the n candidates, we aim to find the optimal claim, c * , that most improves the delivery of c in terms of text and argument quality. Similar to Yoshimura et al. (2020), we tackle this task as a reranking problem. In our reranking strategy, AutoScore, we integrate three metrics: (1) grammatical fluency, (2) meaning preservation, and (3) argument quality. This way, we can explicitly favor specific quality dimensions via respective models: Grammatical Fluency We learn to assess fluency on the MSR corpus of abstractive compressions (Toutanova et al., 2016). The grammaticality of each compression was scored by 3-5 annotators as 1 (major errors, disfluent), 2 (minor errors), or 3 (fluent). We chose this corpus, since the multiple compressions per input make a trained model sensitive to the differences in variations of the same text. For training, we average all annotator scores and transform the task to a binary task, where a compression is seen as disfluent unless all annotators gave the score 3. Then, we train BERT on the binary data to obtain the fluency probabilities (details found in appendix). The accuracy of the  Figure 2: Proposed claim optimization approach: First, a sequence-to-sequence model generates n candidates from the original claim, possibly conditioned on context information. Then, the candidates are reranked with respect to three quality metrics. The top-ranked one is used as the optimized claim.

Meaning Preservation
To quantify to what extent a generated candidate maintains the meaning of the original claim, we compute their semantic similarity in each case in terms of the cosine similarity score of their contextual SBERT sentence embeddings (Reimers and Gurevych, 2019).
Argument Quality Finally, to examine whether the generated candidates are better than the original claim from an argumentation perspective, we fine-tune a BERT model on the task of pairwise argument classification using the ClaimRev dataset. Since this corpus is also used to fine-tune the sequence-to-sequence model, we apply the same training and validation split as described in Section 3.2 to avoid data leakage, and obtain accuracy of 75.5. We then use its probability scores to determine relative quality improvement. Further training details can be found in the appendix.
Given the three quality metrics, we calculate the final evaluation score, AutoScore, as the weighted linear sum of all three individual scores as α · f luency + β · meaning + γ · argument, where f luency, meaning, and argument are the normalized scores for the three outlined quality metrics. The three non-negative weights satisfy α + β + γ = 1.

Experiments
This section describes our experimental setup to study how well the claims in the dataset from Section 3 can be improved using our combined generation and reranking approach from Section 4. We particularly focus on the impact of reranking.

Seq2Seq-based Candidate Generation
For candidate generation, we employ the pretrained conditional language model BART, which combines bidirectional and auto-regressive transformers (Lewis et al., 2020). We use the bart-large checkpoint. However, other sequence-to-sequence architectures can also be considered within the suggested framework (see appendix for details).

Quality-based Candidate Reranking
We evaluate our reranking approach, AutoScore, in comparison to three ablations and four baselines: Approach To utilize AutoScore for ranking candidates, the optimal weighting of its metrics must be determined. We follow Yoshimura et al. (2020), performing a grid search in increments of 0.01 in the range of 0.01 to 0.98 for each weight to maximize the Pearson's correlation coefficient between AutoScore and the original order of the revisions from claim revision histories in the validation set. Similar has been done for counterargument retrieval by Wachsmuth et al. (2018). The best weights we found and used were α = 0.43, β = 0.01, and γ = 0.56, suggesting that meaning preservation is of low importance and potentially may be omitted. We suppose this is due to the general similarity of the generated candidates, so a strong meaning deviation is unlikely.
Ablations To assess the impact of each considered quality metric used in AutoScore, we perform an ablation study, where optimal candidates are chosen based on the individual metric scores: • Max Fluency. Highest grammatical fluency.
Baselines We test four other reranking strategies for 10 candidates generated via top-k sampling: • Unedited Return the original input as output.
• Top-1. Return the most likely candidate (obtained by appending the most probable token generated by the model at each time step). • Random. Return candidate pseudo-randomly. • SVMRank. Rerank candidates with SVMRank (Joachims, 2006). We use sentence embeddings to decide which of the two claim versions is better, by fine-tuning SBERT (bertbase-cased) in a Siamese setup on the corpus of Skitalinskaya et al. (2021).

Evaluation
We explore claim optimization on all 600 test cases, both automatically and manually: Automatic Evaluation We compare all reranking strategies against the reference revisions using the precision-oriented BLEU ( , which computes the average F 1 -scores of the added, kept, and deleted n-grams, and the exact match accuracy. We also compute the semantic similarity of the optimized claim and the context information to capture whether conditioning claims on the context affects their topic relevance.
Manual Evaluation As we fine-tune existing generation models rather than proposing new ones, we focus on the reranking step in two manual annotation studies. For each instance, we acquired five independent crowdworkers via MTurk at $13/hour. In the first study, the annotators scored all candidates with respect to the three considered quality metrics. We used the following Likert scales: • Fluency. 1 (major errors, disfluent), 2 (minor errors), and 3 (fluent) • Meaning Preservation. 1 (entirely different), 2 (substantial differences), 3 (moderate differences), 4 (minor differences), and 5 (identical) • Argument Quality. 1 (notably worse than original), 2 (slightly worse), 3 (same as original), 4 (slightly improved), and 5 (notably improved) A challenge of crowdsourcing is to ensure good results (Sabou et al., 2014). To account for this, we obtained the final scores using MACE (Hovy et al., 2013), a Bayesian model that gives more weight to reliable workers. In the given case, 39% of the 46 annotators had a MACE competence value > 0.3, which can be seen as reasonable in MTurk studies.
In the second study, we asked the annotators to rank the four candidates, returned by the reranking strategies, by perceived overall quality. If multiple candidates were identical, we showed each only once. While Krippendorff's α agreement was only 0.20, such values are common in subjective tasks (Wachsmuth et al., 2017;Alshomary et al., 2021).

Results and Discussion
Apart from evaluating the applicability of large generative language models to the task of argumentative claim optimization in general, our experiments focus on two questions: (1) Does the use of explicit knowledge about text and argument quality in the decoding step lead to the selection of better candidates? (2) Does the use of contextual information make the generated candidates more accurate and relevant to the debate?

Overall Claim Optimization Performance
Automatic Evaluation Table 1 shows the automatic scores of all considered reranking strategies. The high scores of the baseline Unedited on metrics such as BLEU and ROUGE-L, indicate that many claim revisions change little only. In contrast, Unedited is worst on SARI, as this measure takes into account the goodness of words that are added, deleted, and kept in changes, making it more suitable for evaluating the task at hand. Here, BART with AutoScore reranking performs best on SARI (43.7) and exact match accuracy (8.3%).
The BART+Max Meaning ablation supports the intuition that the candidates with highest meaning preservation scores are those with minimal changes, if any (72% of the candidates remain identical to the input). Such identical outputs are undesirable, as the claims are not optimized successfully, which is also corroborated by the low weight parameter (β = 0.01) found for the meaning preservation metric when optimizing AutoScore (see Section 5). Table 2 shows that human annotators prefer the optimized candidates selected by AutoScore, with an average rank of 1.92. The difference to Top-1 and Random is statistically significant (p < .05 in both cases) according to  a Wilcoxon signed-rank test, whereas the significance of the gain over the second-best algorithm, SVMRank, is limited. Also, candidates of Au-toScore and SVMRank are deemed more fluent than those of Top-1 and Random (2.33 vs. 2.29 and 2.26). The argument quality results deviate from the automatic scores, being marginally higher for SVMRank and Top-1. Further analysis revealed that AutoScore and SVMRank agreed on the optimal candidate in 35% of the cases, partially explaining the closeness of the scores.

Manual Evaluation
Overall, we conclude that our approach performed best in the experiments. More importantly, our findings suggest that using reranking approaches that incorporate quality assessments (i.e., AutoScore and SVMRank) leads to candidates of higher fluency and argument quality while preserving the meaning of the original claim. In addition to Figure 1, examples of automatically generated optimized claims can be found in the appendix.   Table 4: Automatic evaluation: Performance of each reranking strategy on the datasets from other domains, in terms of BLEU, Rouge-L, SARI, ratio of unedited samples, and ratio of exact matches to target reference.

Performance with Context Integration
General Assessment Table 3 shows the semantic similarity of claims optimized by our approach and context information, depending on the context given. The results reveal slight improvements when conditioning the model on the previous claim (e.g., 60.3 vs. 59.4 BLEU). To check whether this led to more grounded claims, two authors of the paper compared 600 claims generated with and without the use of the previous claim in terms of (a) which claim seems better in overall quality and (b) which seems more grounded. We found that utilizing the previous claim as context increased quality in 12% of the cases and decreased it in 1% only, while leading to more grounded claims in 36% cases.
Qualitative Analysis Our manual inspection of a claim sample revealed the following insights: First, conditioning on context reduces the number of erroneous specifications, particularly for very short claims with up to 10 words. This seems intuitive, as such claims often convey little information about the topic of the debate, making inaccurate changes without additional context likely.
Next, Kialo revisions often adhere to the following form: A claim introduces a statement and/or supporting facts, followed by a conclusion. This pattern was frequently mimicked by our approach. Yet, in some cases, it added a follow-up sentence repeating the original claim in different wording or generated conclusions containing fallacious or unsound phrases contradicting the original claim in others. Modeling context mitigated this issue.
Finally,we found that models conditioned on different contexts sometimes generated candidates optimized in different regards, whereas a truly optimal candidate would be a fusion of both suggestions.

Analysis
To explore the nature of claim optimization and the capabilities of our approach, this section reports on follow-up analyses, in which we studied (a) what types of claim optimizations exist, (b) how well can our approach operationalize these, and (c) how well does the idea of our approach generalize to revision domains beyond argumentative texts.

Taxonomy of Optimization Types
To understand the relationship between the optimizations found in the data and the underlying revision intentions, two authors of this paper manually inspected 600 claim revision pairs of the test set. This allows for a detailed analysis of the obtained results, as we are able to identify more fine-grained optimization types in the given task.
For the type distinction, we build on ideas of Yang et al. (2017) who provide a taxonomy of revision intentions in Wikipedia texts. Claims usually do not come from encyclopedias, but from debates of various shades (an online debate platform in our case) or from monological arguments, as in essays (Persing and Ng, 2015). Therefore, we adapt the terminology of Yang et al. (2017) to gear it more towards argumentative styles. Since we aim for optimization in the end, we consider actions rather than intentions. Whereas the former refers to specific changes (e.g., rephrasing a sentence or adding punctuation), the latter describes the goal of a change (e.g., making a text easier to read).
As a result of a joint discussion of various sample pairs, we decided to distinguish eight optimization types, as presented in Table 5. Both authors then annotated all 600 test pairs for these types, which led to only 29 disagreement cases, meaning a high agreement of 0.89 in terms of Cohen's κ. These cases were resolved by both annotators together. 4 Table 5 also shows cooccurrences of the types and intention labels. Typo/grammar correction and # Optimization Description of the Type Clarification Grammar Links

Specification
Specifying or explaining a given fact or meaning (of the argument) by adding an example or discussion without adding new information.
58 1 -2 Simplification Removing information or simplifying the sentence structure, e.g., with the intent to reduce the complexity or breadth of the claim.

--
3 Reframing Paraphrasing or rephrasing a claim, e.g., with the intent to specify or generalize the claim, or to add clarity.

--4 Elaboration
Extending the claim by more information or adding a fact with the intent to make the claim more self-contained, sound, or stronger.

23
--5 Corroboration Adding, editing, or removing evidence in the form of links that provide supporting information or external resources to the claim.

-153
6 Neutralization Rewriting a claim using a more encyclopedic or neutral tone, e.g., with the intent to remove bias or biased language. 7 --7 Disambiguation Reducing ambiguity, e.g., replacing pronouns by concepts mentioned before in the debate, or replacing acronyms with what they stand for.   correcting/adding links align well with copy editing and corroboration respectively. In contrast, clarification is broken into more fine-grained types, where specification seems most common with 58 cases, followed by simplification and reframing.
Examples of each type are found in the appendix.
We point out that the eight types are not exhaustive for all possible claim quality optimizations, but rather provide insights into the semantic and discourse-related phenomena observed in the data at hand. We further see them as complementary to the argument quality taxonomy of Wachsmuth et al. (2017). In particular, they can be seen as actions to improve the delivery-related quality dimensions: clarity, appropriateness, and arrangement.

Performance across Optimization Types
To enable comparison between the human optimizations and the output of our system, we also labeled 600 claims optimized by BART+AutoScore with the proposed types. Table 6 directly compares automatic and human optimization types. Overall, our approach generates better claims in 60% of the cases, while 84% remain at least of similar quality.
Most noteworthily, we observe that our approach performs optimizations of the type specification 2.5 times as often as humans, and more than double as many elaboration revisions (55 vs. 23). In contrast, it adds, edits, or removes evidence in the form of links (corroboration) four times less often than humans. The model also made fewer simplifications (18 vs. 43) and no neutralization edits, which may be due to data imbalance regarding such types.
In terms of average quality, specification (65%) and disambiguation edits (63%) most often lead to improvements, but the different types appear rather balanced in this regard.The Jaccard similarity score between optimizations performed by humans and our approach is 0.37, mostly agreeing on copy edits (178 cases) and corroboration (22 cases). Given such low overlap, future work should consider conditioning models to generate specific optimizations.

Performance across Revision Domains
Lastly, we examine whether our approach, along with the chosen text quality metrics, applies to texts from other domains. We consider two datasets: WikiHow (Anthonio and Roth, 2020), containing revisions of instructional texts, and IteraTeR (Du et al., 2022), containing revisions of various formal texts, such as encyclopedia entries, news, and scientific papers. For our experiments, we use the provided document-level splits, and sample 1000 revision pairs pseudo-randomly as a final test set. Table 4 shows the automatic evaluation results. In both cases, BART+Autoscore leads to higher SARI scores (48.5 vs. 41.3 for WikiHow, 38.6 vs. 37.0 for IteraTeR), and notably reduces the number of cases where the models failed to revise the input (0.08 vs. 0.50 for WikiHow). The reported BART+Top1 model represents the approach of Du et al. (2022), indicating that our approach and its text quality metrics achieve state-of-theart performance with systematic improvements across domains, when generating optimized content. However, as different domains of text have different goals, different notions of quality, and, subsequently, different revision types performed, integrating quality metrics capturing characteristics directly relevant to the domain may improve the performance of the suggested framework. We leave this for future work.

Conclusion
With this paper, we work towards the next level of computational argument quality research, namely, to not only assess but also to optimize argumentative text. Applications include suggesting improvements in writing support and automatic phrasing in debating systems. We have presented an approach that generates multiple candidate optimizations of a claim and then identifies the best one using qualitybased reranking. In experiments, combining finetuned BART with reranking improved 60% of the claims from online debates, outperforming different baseline models and reranking strategies. We showcased generalization capabilities on two outof-domain datasets, but we also found some claim optimization types to be hard to automate.
In future work, we seek to examine whether the latest language models (e.g., GPT-3) and end-toend models (where generation and reranking are learned jointly) can further optimize the quality of claims. Moreover, our approach so far relies on the availability of large claim revision corpora and language models. To make claim optimization more widely applicable, techniques for low-resource scenarios and languages should be explored.

Acknowledgments
This work was partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under project number 374666841, SFB 1342.

Limitations
This work contributes to the task of argumentative text editing, namely we explore how to revise claims automatically in order to optimize their quality. While our work may also improve downstream task performance on other tasks, it is mainly intended to support humans in scenarios, such as the creation and moderation of content on online debate platforms as well as the improvement of arguments generated or retrieved by other systems. In particular, the presented approach is meant to help users by showing examples of how to further optimize their claims in relation to a certain debate topic, so they can deliver their messages effectively and hone their writing skills.
However, our generation approach still has limitations, as outlined in Section 6, and may favor revision patterns over others in unpredictable ways. While it may occasionally produce false claims, humans should be able to identify such cases in light of the available context, as long as the improvements remain suggestions and do not happen fully automatically, as intended. Moreover, we expect that further research can ensure that the produced claims are of decent quality by being more attentive to the veracity of claims. Such focus may allow to improve argumentative text consistently and truly support humans, rather than hindering them. We also would like to point out that using other pre-trained models to assess the fluency, semantic similarity and argument quality may further improve the results depending on the target domain. This could be especially important in scenarios where certain quality dimensions may be of special interest, such as for example, convincingness or argument strength. In such cases, the quality metrics considered in the suggested framework and their weights in the overall score should be adjusted towards the needs of the users.
The presented technology might also be subject to intentional misuse. A word processing software, for example, might automatically detect and adapt the claims made by the user in a way that favors political or social views of the software provider. Those changes might then not even be made visible to the user, but only be revealed after exporting or printing the text. In a different scenario, online services, such as social media platforms or review portals, might change posted claims (e.g. social media posts, online reviews) to personalize them and increase user engagement or revenue. These changes might then not only negatively affect the posting, but also the visiting user. While it is hard to prevent such misuse, we think that the described scenarios are fairly unlikely, as such changes tend to be noticed by the online community quickly. Furthermore, the presented architecture and training procedure would require notable adaptations to produce such high-quality revisions.

A.1 BART-based models
For generation we use the pre-trained BART model implemented in the fairseq library. The library and pre-trained models are BSD-licensed. We use the BART-large checkpoint (400M parameters) and further finetune the model for 10 epochs on 2 RTX 2080Ti GPUs. We use the same parameters as suggested in the fine-tuning of BART for the CNN-DM summarization task by fairseq and set MAX-TOKENS to 1024. The training time is 100-140 minutes, depending on the chosen setup (with or without context information). During inference, we generate candidates using a top-k random sampling scheme (Fan et al., 2018) with the following parameters: length penalty is set to 1.0, n-grams of size 3 can only be repeated once, temperature is set to 0.7, while the minimum and maximum length of the sequence to be generated are 7 and 256 accordingly.

A.2 BERT-based models
For the automatic assessment of fluency and argument quality, we use the bert-base-cased pretrained BERT version, as implemented in the huggingface library. The library and pre-trained models have the Apache License 2.0. We finetune the model for two epochs and use the parameters suggested in Skitalinskaya et al. (2021). The accuracy of the trained model for fluency obtained on the train/dev/test split suggested by the authors (Toutanova et al., 2016) is 77.4 and 75.5 for argument quality.
For labeling the missing or unassigned revision types, we use the same bert-base-cased pre-trained BERT model, but in a multi-label setup, where we consider the following 6 classes: claim clarification, typo or grammar correction, correcting or adding links, changing the meaning of the claim, splitting the claim, and merging claims. We fine-tune the model for two epochs using the Adam optimizer with a learning rate of 1e-5 and achieve a weighted F1-score of 0.81.

B Claim Optimization baselines
For comparison we provide two additional baseline sequence-to-sequence model architectures, which help identify the complexity of the model needed for the task at hand: LSTM. Our first baseline is a popular LSTM variant introduced by Wiseman and Rush (2016).   We use the lstm_wiseman_iwslt_de_e architecture, which is a two-layer encoder and decoder LSTM, each with 256 hidden units, and dropout with a rate of 0.1 between LSTM layers.
Transformer. The second model is based on the work of Vaswani et al. (2017). We use the trans-former_iwslt_de_en architecture, a 6-layer encoder and decoder with 512-dimensional embeddings, 1024 for inner-layers, and four self-attention heads.
Tables 7 and 8 compare the automatic evaluation scores of all generation-reranking combinations.

C Claim Optimization Examples
For all eight optimization categories, we provide one or more examples illustrating each action in Table 9. Figure 3 shows the annotation guidelines for the Amazon Mechanical Turk study. Table 10 provides examples of candidates selected by different reranking strategies along with human references illustrating common patterns found in the results. Table 11 provides examples of candidates generated with and without utilizing context knowledge with insertions and deletions being highlighted in green and red fonts accordingly.

Specification
Nipples are the openings of female-only exocrene glands that can have abnormal [secretions] <LINK> during any time of life, get erected by cold stimulation or sexual excitement (much more visibly than in men), get lumps or bumps and change color and size of areola during the menstrual cycle or pregnancy, so their display can break [personal space] <LINK> and privacy (which is stressful), affect public sensibilities and also be a [window] <LINK> for infections, allergies, and irritation.
The idea behind laws, such as limiting the amount of guns, is to reduce the need to defend yourself from a gun or rapist.
It is very common for governments to actively make certain forms of healthcare [harder for minority groups to access] <LINK>. They could also, therefore, make cloning technology hard to access.

Simplification
Very complex, cognitively meaningful behavior such as behaviours like creating art are evidence of free will, because they exhibit the same lack of predictability as stochastic systems, but are intelligible and articulate clearly via recognizable vehicles.

Reframing
It reduces the oversight of the BaFin and thus increases the risk of financial crisis market failures.

Elaboration
It takes 2-4 weeks for HIV to present any symptom. The incubation period risk can't be ruled out for is higher for a member of high risk group, effectively and timely even though member of a low risk group is not completely safe. The decision is based on the overall risk, not on individual level.

Corroboration
[Person-based predictive policing technologies] <LINK> -that focus on predicting who is likely to commit crime rather than where is it likely to occur -violate the [presumption of innocence.] <LINK>.

Neutralization
Biden does not lacks the support or agree with several key issues that are important to liberal voters. of many liberal voting groups due to his stance on key issues concerning them.  Table 9: Illustrative examples of optimization types identified in the paper. The green font denotes additions and the striked out red font denotes the removal of text snippets.

Instructions
In this task, your goal is to identify whether a claim has been successfully improved, without changing the overall meaning of the text. Each task contains a set of pairs, where one claim is the "original claim," and the other an optimized candidate. Each of these pairs have the same original text, but different candidate optimizations.
Please rate each candidate along the following three perspectives: argument quality, fluency and semantic similarity. And, finally, please, rank all candidates relative to each other in terms of overall quality.

Argument Quality
Scale (1-5): 1 (notably worse than original), 2 (slightly worse), 3 (same as original), 4 (slightly improved), 5 (notably improved) Does the optimized claim improve the argument quality compared to the original claim? Relevant changes include, but are not limited to: • further specifying or explaining an existing fact or meaning • removing information or simplifying the sentence structure with the intent to reduce the complexity or breadth of the claim • rephrasing a claim with the intent to specify or generalize the claim, or to add clarity • adding (substantive) new content or information to the claim or inserting an additional fact with the intent of making it more self-contained, more sound or stronger • adding, editing or removing evidence in the form of links that provide supporting information or external resources to the claim • removal of bias or biased language • removal uncertainty. e.g. by replacing pronouns referring to concepts that have been mentioned in other claims of the debate, or by replacing acronyms with what they stand for • improving the grammar, spelling, tone, or punctuation of a claim Meaning Scale (1-5): 1 (entirely different), 2 (substantial differences), 3 (moderate differences), 4 (minor differences), 5 (identical) Does the transformed claim still have the same overall meaning as the original? It is OK if extra information is added, as long as it doesn't change the underlying people, events, and objects described in the sentence. You should also not strongly penalize for meaning transformations which aim to generalize or specify some aspects of the claim.
Examples with explanations can be found here. Figure 3: Guidelines provided to the crowdworkers to annotate automatically generated claim optimizations.

Original
There was much progress, advancement and forward looking thinking in the 20th century in Europe across all disciplines. Top-1 There was much progress, advancement and forward looking thinking in the 20th century in Europe across all disciplines. Therefore, it is logical to conclude that the world has changed in the last 150 years. SVMRank There was much progress, advancement and forward looking thinking in the 20th century in Europe across all disciplines, which shows the progress that was made during the century. AutoScore There was much progress, advancement, and forward looking thinking in the 20th century in Europe across all disciplines.

Human
There was much progress, advancement and forward looking thinking in the 20th century in Europe across all disciplines.  Hunt is the best placed candidate to unify Parliament in order to get a Brexit deal passed. Claim only Hunt is the best placed candidate to unify Parliament and the public in order to get a Brexit deal passed.

+ Previous
Hunt could use his position as the Prime Minister to unify Parliament in order to get a Brexit deal passed. + Topic Hunt is the best placed candidate to [unify Parliament] <LINK> in order to get a Brexit deal passed Topic Should high-income countries take in refugees? Previous Taking in refugees will increase criminality within host countries.

Original
Differences in criminal activity are a product, in part, of childhood [socioeconomic] <LINK> conditions. This is exacerbated by the longer [path] <LINK> to employment faced by refugees compared to other communities. Claim only Differences in criminal activity are a product, in part, of childhood [socioeconomic] <LINK> conditions. This is exacerbated by the longer [path] <LINK> to employment faced by refugees compared to other communities, making them more likely to get involved in crime. + Previous Differences in criminal activity are a product, in part, of childhood [socioeconomic] <LINK> conditions. This is exacerbated by the longer [path] <LINK> to employment faced by refugees compared to other communities. This will not increase criminality. + Topic Differences in criminal activity are a product, in part, of childhood [socioeconomic] <LINK> conditions. This is exacerbated by the longer [path] <LINK> to employment faced by refugees compared to other communities, which make it harder to find a job.

Topic
Mark Twain used the N-word in The Adventures of Huckleberry Finn. Should it be censored? Previous Changing the N-word would skip a piece of the linguistic past and thus everyday life. As a result, people could start to forget this part of history.

Original
In Huckleberry Finn, Twain captured the essence of "[everyday midwest American English] <LINK>". Claim only In Huckleberry Finn, Twain captured the essence of "[everyday midwest American English] <LINK>".This is a common trait of the American English language. + Previous In Huckleberry Finn, Twain captured the essence of "[everyday midwest American English] <LINK>"by using the N-word in everyday conversation. + Topic In Huckleberry Finn, Twain captured the essence of "[everyday midwest American English] <LINK>", which is a language that is often used by people who do not share his values. Table 11: Examples of different candidates generated by BART + AutoScore with and without context information.
The green font denotes additions of text snippets.