Perceptual Models of Machine-Edited Text

We introduce a novel dataset of human judgments of machine-edited text and initial models of those perceptions. Six machine-editing methods ranging from character swapping to variational autoencoders are applied to collections of English-language social media text and scientiﬁc abstracts. The edits are judged in context for detectability and the extent to which they preserve the meaning of the original. Automated measures of semantic similarity and ﬂuency are evaluated individually and combined to produce composite models of human perception. Both meaning preservation and detectability are predicted within 6% of the upper bound of human consensus labeling.


Introduction
Machine-editing systems produce new versions of text using text as input. They contribute to tasks such as automatic summarization, simplification, natural language generation and generative adversarial NLP systems. These tasks have communicative goals, for example shorter, more accessible, or more appropriate text, and system developers are encouraged to improve their correlation with human performance on these tasks. While the measured task performance of machine-editing systems continues to improve, one might consider how humans perceive machine-edited text compared to human-produced text. One-off human evaluation of editing systems is expensive, incomparable, and must be constantly repeated. In this work we make first attempts at direct, general-purpose modeling of human perception of these texts and develop a model of human perception as it relates to two goals: to maximally maintain the meaning of an original and be minimally perceptible as machine output. We present a dataset of human judgments about detectability and meaning preservation for machine-edited text. 1 This dataset consists of 14,400 judgments about contextualized pairs of machine-edited sentences. The original texts are English-language and come from two domains: scientific papers and social media. The edits are created by six different algorithms using a variety of techniques. By comparing trivial editors to more subtle approaches under the same evaluation framework, we move toward generic models of perception of edited text.
Our analysis finds high interannotator agreement and examines human preference among the six machine editors that generated the candidates. Existing measures of similarity and fluency are evaluated as models of perception. We find that referenceinformed models come close to human consensus of meaning preservation and detectability. However, language models that don't have access to the reference text have less success as generic models of detection. This dataset and analysis constitute a step toward modeling meaning preservation and detectability under a variety of machine-editing conditions representative of the state of the practice.

Background
Machine editing is a component of multiple tasks that balance meaning preservation and fluency differently.

Machine-Editing Tasks
Text simplification (Saggion, 2017) and summarization (Narayan et al., 2018) produce new versions of text that are simpler or shorter, intended to be useful to a human reader. Evaluation measures informativeness relative to a reference. Abstractive techniques that fully rewrite the text have recently become viable alternatives to extractive techniques that build new texts from portions of the original text.
Paraphrase generation is the task of producing semantically equivalent variants of a sentence and can underlie applications like question answering and data augmentation. Recent approaches include a component that generates alternatives and a component that estimates their quality as paraphrase Kumar et al., 2020).
Natural language watermarking of text (Venugopal et al., 2011) and text steganography (Wilson et al., 2014) are conditional text generation practices that require both a meaningful surface form and the hidden encoding of additional information. In this case, it's essential that the text appear plausible as readers should not suspect the encoded information (Wilson et al., 2015).
Edited texts are used in adversarial learning and attacks for text processing systems. Adversarial inputs change a system output without altering some relevant aspect of human perception of the text, e.g. sentiment when attacking a sentiment analysis system (Alzantot et al., 2018). In cases of adversarial learning, where edited texts are used only to promote system robustness, human perception is not a concern (Jia and Liang, 2017). In contrast, adversarial attack vectors rely on human perception of the attack, whether it be communicating meaning regardless of detectability (Eger et al., 2019) or guaranteeing fluency (Zhang et al., 2019a). While authors have quantified the effect of adversarial perturbations on metrics of text quality like word modification rate and count of grammatical errors (Zeng et al., 2020), the relation of these automatic metrics to human perception is not yet studied.

Meaning Preservation
Machine editing often aims to guarantee semantic similarity or meaning preservation between input and output. Meaning preservation can be insensitive to surface forms such as tokenization, casefolding, stylistic variation in punctuation, spacing, font choice, and tense. Compact text representations (e.g. Morse code) tend to regularize all potential surface forms.
Semantic textual similarity and paraphrase identification are active areas of investigation in the NLP community (Cer et al., 2017). Natural language inference (NLI) also relies on notions of semantic similarity to recognize a larger set of rela-tions between texts (Bowman et al., 2015). These subfields of NLP investigate semantic relatedness between human-authored texts.
Meaning preservation is related to the concept of informativeness used in automatic summarization and adequacy for machine translation. Summarization metrics tend to lean toward recall to make sure the central concepts of reference summaries are produced and MT metrics tend to lean toward precision to penalize systems that generate something outside of the references.
Many adversarial text editors don't require strict paraphrase, but simply that their perturbations not change the input's classification to a human reader (Ren et al., 2019;Lei et al., 2019;Alzantot et al., 2018;Ebrahimi et al., 2018). Other authors ask judges about similarity to the unperturbed original (Zhao et al., 2018;Alzantot et al., 2018;Ribeiro et al., 2018;Jin et al., 2020). New work correlates automatic metrics with human judgments capturing both semantic similarity and fluency about three word-and character-swapping algorithms (Michel et al., 2019).

Detectability
Language models were introduced early in both automatic speech recognition (Bahl et al., 1983) and statistical machine translation (Brown et al., 1990) to make output text more readable. They aimed to avoid decoding results that appeared computergenerated.
Recent work in several natural language generation tasks augments automatic evaluation, which approximates informativeness, with one-off human evaluations that estimate text quality. Authors elicit judgment for abstractive summaries about readability (Paulus et al., 2018), fluency (Hardy and Vlachos, 2018), and preference between human and machine-written abstracts (Fan et al., 2018). Desai et al. (2020) elicit human judgements of grammaticality for a compressive summarization system that deletes plausible spans. In image captioning and dialogue systems, several learned metrics judge system output to be higher quality when it is less distinguishable from human text (Cui et al., 2018;Lowe et al., 2017) Several methods of generating adversarial text have been evaluated through surveys of human perception, for example by asking humans to detect the location of machine edits (Liang et al., 2018), or to judge the likelihood that a sentence is mod- ified by a machine (Ren et al., 2019) or written by a human (Lei et al., 2019). Other authors ask human annotators about proxies like grammaticality (Jin et al., 2020), fluency (Zhang et al., 2019b) or readability (Hsieh et al., 2019) as a proxy for detectability. Far more work asks whether computers can detect machine-edited text. Research on text generated with large language models finds that the output is easy to detect automatically because of the probabilities of the particular language model itself (Adelani et al., 2020;Gehrmann et al., 2019;Zellers et al., 2019). In fact, the generation setting that best fools humans produces output that is easy to detect automatically (Ippolito et al., 2020). This suggests human perception of such edits is different from machine detection.
Detectability and meaning preservation are not independent variables, but they represent different aspects of human perception. Destroying the fluency of a text can make it detectable as an edit in a high quality research document, but rewriting a section of chat in standard English can make it detectable in context. One can often transpose digits in scientific measurements to indetectably destroy meaning, and one could rewrite an abstract in randomized case patterns to raise suspicion without altering meaning.

Dataset Construction
We present a dataset of human judgments about two tasks, meaning preservation and detection, in each of two domains, social media and science writing. For each task and domain, we distributed packets of 600 multiple-choice questions to six judges. Each question was an AB test for a pair of editing systems both operating on a sentence in context. The first 105 questions of each packet were the same for all judges and are used to measure interannotator agreement. The remaining 495 sentences were the same, but the pairs of systems compared by judges varied. The judges were all native English speakers who work in AI research and were unfamiliar with Which better preserves the meaning of the reference?
Reference: Later on that day I emailed the company that I purchased my order from and they confirmed it was delivered to that address. A. Later that day I emailed the company I bought my order, and they confirmed that was delivered to that address. B. Later in that time i received the website and i sent my email from what it said it was delivered for customer address.
Location: Florida I didn't know what to flair. About a month ago a package I ordered was delivered to my old apartment complex. When I went to the front office to ask if a package with my name was turned in they said no such thing had occurred.
I don't know how to move forward from this.
Which sentence reads more like it was altered by a machine?
A. When thi applied voltage is ifcreased to a few mV we find a strong declease of the spin injection efficiency. B. While the required voltage is required to a tunnel voltage to obtain a lower amount of the joule injection injection.
Semiconductor spintronics will need to control spin injection phenomena in the non-linear regime. In order to study these effects we have performed spin injection measurements from a dilute magnetic semiconductor [(Zn,Be,Mn)Se] into nonmagnetic (Zn,Be)Se at elevated bias.
The observed behavior is modelled by extending the charge-imbalance model for spin injection to include band bending and charge accumulation at the interface of the two compounds. We find that the observed effects can be attributed to repopulation of the minority spin level in the magnetic semiconductor. the processes used to edit the original text.
Source sentences for the ArXiv dataset were randomly selected from all sentences in ArXiv abstracts submitted between its start in 1991 and the end of January, 2018. The Reddit sentences were randomly selected from all sentences in Reddit posts made in January, 2018. The two source collections were roughly the same size. Sentences less than 10 tokens or longer than 40 tokens were avoided in both collections to ensure judge productivity. To satisfy IRB and to minimize the likelihood of negative effects on judges, we excluded all posts from the subreddits listed on the official nsfw list, and any that were no longer reachable by September 2019. Table 1 describes statistics about the sentences selected for editing and the contexts provided for judges.
The meaning preservation task involved AB judgments on six different editing systems. For the detection task, we included the original texts among the edited variants for a total of seven sys-tems. We refer to this as the null editor. An all-pairs design of six systems requires 15 pairs and an all-pairs design of seven systems requires 21 pairs. Both designs were iterated to yield 600 pairs of system variants, truncating the final seven system pattern early. The first 105 example editing pairs (seven full all-pairs sets for meaning and five for detection) were identical for all judges and the remaining 495 in each packet were chosen from the possible pairs according to independent permutations to encourage balance.
This can be described again for more clarity. C(6, 2) = 15 (meaning preservation) and C(7, 2) = 21 (detectability). 600 examples lines up perfectly with a 15-item boundary but not with a 210 item boundary. Thus, there are 12 examples left over from a complete set of all-pairs of 7 systems in detectability when truncating to 600 items per source per judge. The first 105 system pair assignments come from 7 cycles through C(6, 2) pairs or 5 cycles through C(7, 2). The remaining sequences of pairs for each judge are all-pairs cycles through independently randomized permutations of the systems. Machine edit assignment to positions A and B were independently shuffled for each judge and the questions were presented to each judge in randomly shuffled order. Judges were instructed to choose between the two alternatives.
Each item was presented as a choice between two edited versions of the same sentence, presented with the rest of the Reddit post or ArXiv abstract as context. Figure 1 shows examples and the specific prompts used to elicit judgments. In the meaning preservation example, the candidates were produced by round-trip machine translation and the VAE. In the detection example, the candidates were produced by charswap and the VAE. Cases where detection paired the null system against a machine editor were collected to determine how often each editor was preferred to the original.
Less than half of one percent of detectability items are automatically marked as ties. These include cases where the edited text is the same string as the original, disregarding casing and punctuation, or where the VIPER editor (described below) produced an alternative that rendered identically in packets. These are included in the analysis to capture the intuition that a perceptual model should score ties the same.

Machine-Editing Systems
We employ six editing systems to capture the effect that varied systems have on human perception. Each takes just the sentence to be edited, without context.

Swapping editors
Simple word-and character-swapping editors are prevalent in literature about adversarial attacks and data augmentation (Michel et al., 2019). Our charswap editor is inspired by several works in adversarial NLP that examine character swapping as a minimal change to text inputs that can degrade system performance (Belinkov and Bisk, 2018;Ebrahimi et al., 2018). Our implementation randomly swaps 1 to 3 lower-case ASCII characters per input for other ASCII characters, selecting the least likely of 100 alternatives under the GPT-2 language model (Radford et al., 2019).
VIPER is a character-swapping algorithm informed by visual closeness, inspired by a common strategy used to avoid keyword filters, for example in online forums Eger et al. (2019). The VIPER algorithm replaces random characters with their nearest neighbors among embeddings based on their glyph e.g. l→1 and 0→O. We further bias the open source implementation toward visual closeness by randomly swapping between 1 and 3 characters, with the probability of each swap weighted by its visual similarity.
The AddCos system uses word embedding distance to replace a single word with a paraphrase. The algorithm is adapted from a machine translation metric that measures the fit of words that are not in a reference, using the cosine similarity of the proposed replacement and the sum of vectors for sentence context (Apidianaki et al., 2018).
We adapt the open source implementation as a machine editor, obtaining candidate replacements from the Penn Paraphrase Database (Ganitkevitch et al., 2013) and selecting the one best replacement.

Rewriting editors
Machine translation (MT) has recently become reliable and on-par with human translation capabilities in some cases (Bojar et al., 2018). We utilized round trip MT (from source English text to another language and then back) as a type of text editor. Three of the authors performed a blind assessment of approximately one hundred candidate languages available from an online MT provider and determined that en → pt → en is a highquality round trip route.
A variational autoencoder (VAE) learns a semantically meaningful latent space. We use an implementation 2 based on the model of Zhang et al. (2017) to train a VAE for each domain with 200,000 sequences of up to 40 tokens. Edits are obtained by encoding an original sentence and sampling from the latent distribution.
Syntactically controlled paraphrase networks (SCPNs) encode sentences and decode them according to a target constituency parse (Iyyer et al., 2018). Unlike swapping editors, this system introduces syntactic variation. Using the open source code and default templates, we generate ten paraphrases per sentence. We select the paraphrase with the best GPT-2 language model score.

Modeling Human Perception
We evaluate a set of automatic metrics as models of human perception. To test a metric as a model of the collected judgments, the metric scores each edited sentence and chooses the item in the pair with the better score. The choice is compared to the judge's preference.
In addition, we learn a combination system that scores sentences by weighting each component metric. One combination is learned for each task, using the data from both domains. The data is split into a training set of 80% used for fitting the combination, a validation set of 10% and a final test set of 10% of examples. For items repeated among judges, all six instances are assigned to the same partition.
Our objective function, maximizing agreement on AB tests, is neither continuous, smooth, nor particularly amenable to a logistic transform. We search for our mixture parameters using the Dlib MaxLIPO+TR Lipschitz function and trust region search algorithm (King, 2009). The model is optimized to minimize the errors in the training set with an L1 regularization term. A best combination is selected using forward feature selection and validation set accuracy.

Experiments
We examine text similarity and fluency metrics that originate from several tasks in NLP as possible 2 https://github.com/mitre/tmnt models of human perception. We first present the portfolio of metrics we use.

Measures of Meaning Preservation
Levenshtein edit distance measures the minimum number of character operations needed to change one string into another (Levenshtein, 1966). We compute both the classical Levenshtein distance over character edits and word edits (WER).
NLP Task Metrics. We evaluated several metrics used to measure the quality of NLP system output compared to a human reference for tasks including machine translation, summarization, and image captioning. BLEU is a machine translation evaluation method based on word n-gram precision, with a brevity penalty (Papineni et al., 2002). The ME-TEOR metric uses stemming and WordNet synsets to characterize acceptable synonymy in translation (Banerjee and Lavie, 2005). CIDEr also uses stemming and incorporates importance weighting for ngrams based on corpus frequency (Vedantam et al., 2015). The ROUGE-L metric, used in summarization and image captioning, is based on longest common subsequence between a reference and hypothesis (Lin, 2004). ChrF and variants like chrF++ compare bags of character and ngram substrings to capture sub-word similarity without language-specific resources (Popović, 2016(Popović, , 2017. The BEER metric is trained to correlate with human judgment at a sentence level using features like character n-grams and permutation trees that are less sparse at that level (Stanojević and Sima'an, 2014).
Neural Network-based Similarities. Recent work uses trained, neural-network vector representations to quantify semantic similarity. We experiment with three based on BERT, a neural network trained on Wikipedia and the Google Books Corpus (Devlin et al., 2019). BERTScore computes an F1-based similarity score between the contextual embeddings for subword tokens in a candidate and reference sentence (Zhang et al., 2020). The metric can also be computed as RoBERTaScore using weights from RoBERTa pretraining (Liu et al., 2019). BLEURT fine tunes BERT to predict sentence-level machine translation quality scores (Sellam et al., 2020). Sentence-BERT measures similarity using a model finetuned with a paraphrase objective to create semantically meaningful sentence vectors that can be directly compared (Reimers and Gurevych, 2019).

Measures of Detectability
Detectability is a property of text in context, without regard for a reference. We evaluate language model scores, which measure fluency, as proxies for detectability. We evaluate a Kneser-Ney 5-gram language model trained on a full English Wikipedia dump (Wikipedia contributors, 2020) with KenLM (Heafield, 2011). We estimate the model using modified Kneser-Ney smoothing without pruning. We also evaluate the language model score given by GPT-2, a large neural transformerbased language model trained on 8 million web pages (Radford et al., 2019). We use the technique described in Salazar et al. (2020) to obtain a BERT Masked Language Model (MLM) that accounts for the model's self-attention. We compute each language model score under two conditions: using only the edited sentence, and including one sentence before and after the edited sentence (+context).
Predictions from BERT's Next Sentence Prediction (NSP) task estimate the likelihood for a sequence of sentences. This classifier is trained to discriminate sequences of two sentences found in the pretraining corpus from sequences drawn using negative sampling (Devlin et al., 2019). Table 2 illustrates the relative success of the machine-editing systems. Success is measured using the number of A/B tests where an edit by the system was selected (for meaning) or the other item selected (for detectability), divided by the number of prompts involving the editor. Preference refers to only the portion of the detectability dataset where the edited text is compared to the original. The swapping algorithms are most often chosen as preserving meaning. The visual perturbations of VIPER have little effect on perception of meaning. The preference for these editors is not as strong on detectability items. For both tasks, the round-trip machine translation model is preferred in slightly over half of comparisons, while the VAE and SCPN perform quite poorly. One substantial difference in these conditional generation algorithms may be that MT is trained on web-scale data, while the others are trained in-house with relatively small datasets.

Results
Among detectability items, human judges prefer an edited version over the original (null system) 4.9% of the time, 101 of 2054 relevant judgments. These prompts most commonly include the round trip machine translation editor, but all editing systems were preferred over the original at least once. Round-trip machine translation is picked over the original reference 20% of the time in ArXiv and 10% of the time in Reddit, suggesting that these outputs are more fluent or more typical for the domain than the original. For these items, the character swapping algorithms are most detectable.

Annotator Consistency
One hundred five prompts per task were presented to all judges to measure interannotator agreement. As judgments are made between constructed, randomized flips and pairwise tests, we compare to the raw prior of 50% agreement. For ArXiv, the probability of agreement among pairs of judges was 82.2% for meaning and 75.6% for detection. For Reddit, the probabilities were 86.7% and 75.9% respectively. The lower interannotator agreement in the science and technology domain may reflect lower familiarity with the subjects of the abstracts.
A consensus vote is determined by a plurality of the six judges, or randomly in cases of ties. The probability of agreement of a random judge with the consensus is reported in Table 3 as an upper bound for the performance of automatic systems. Table 3 shows the correspondence of the best metrics with the 630 multiply-annotated prompts and includes the upper bound of human consensus. The table shows only metrics with accuracy within five items of the best. Table 4 shows agreement with the entire dataset and over the full set of systems tested. The meaning metrics are also evaluated as measures of detectability. At editing time, they can be used to estimate the detectability of a candidate edit. However, they are not practical as generic      detection models sniffing out machine edits in the wild where no original is available.

Correspondence of Metrics with Human Judgment
Several automatic metrics show good correspondence with meaning. The best systems include large, neural models intended to capture subtle synonymy as well as simple metrics like chrF. In general, the recall component of BERTScore-based metrics correlates better than the precision component. Though the BLEURT metric is trained to predict human judgements of translation quality, it seems a poor fit for perceptions of meaning preservation in our dataset. Applied to the detection task, the reference-informed metrics also approach the upper bound of human consensus. Using RoBERTaScore as a single model of both meaning preservation and detectability reaches over 81% agreement with consensus.
The language model metrics fall behind in performance on detection but still perform well above the level of chance. We find that including additional context improves performance for the same system and the large, neural models greatly outperform the traditional 5-gram model. Across the board, models with RoBERTa training perform better than their BERT-based counterparts. Table 5 shows performance for predicting the detection items for which the judge preferred the edited text to the original. A baseline system that always selects the original gets around 95% accuracy, but cannot identify an edited text that a human accepts as a substitute. All of the detection systems tested were able to identify some substitutable edits. The best overall are large language models with context, reaching 0.241 F1.
As shown in Table 6, learned combinations of metrics are able to achieve better performance than the single best metric for each task. The components of those systems are specified in Table 7, sorted by their importance in the combination as calculated by the product of the standard deviation of their values (σ) and the magnitude of their weights (w).

Conclusion
We introduced a novel dataset of human judgments of machine-edited texts and initial models of those perceptions. A portfolio of automated metrics was assessed for the ability to predict judges' preferences on meaning preservation and detectability. Automated measures of semantic similarity and fluency were evaluated individually and combined to produce factored models of human perception. Both meaning preservation and detectability are modeled within 6% accuracy of the upper bound of human consensus labeling. However, we observe that existing metrics poorly predict whether humans find an edited text to appear more human-like than the original.
Future work could explore deeper models and other factors of human perception not modeled by the metrics present here. For example, humans are sensitive to capitalization and correct spacing but many automatic metrics perform tokenization and normalization. Direct modeling of human perception drives understanding of human factors involving text variation. Adaptive models of human text perception would enable text editing to target understanding by individual readers.