Human Perception in Natural Language Generation

We ask subjects whether they perceive as human-produced a bunch of texts, some of which are actually human-written, while others are automatically generated. We use this data to fine-tune a GPT-2 model to push it to generate more human-like texts, and observe that this fine-tuned model produces texts that are indeed perceived more human-like than the original model. Contextually, we show that our automatic evaluation strategy well correlates with human judgements. We also run a linguistic analysis to unveil the characteristics of human- vs machine-perceived language.


Introduction
Pre-trained language models, such as the BERT (Devlin et al., 2019) and the GPT (Radford et al., 2018, 2019 families, are nowadays the core component of NLP systems. These models, based on the Transformer (Vaswani et al., 2017) and trained using huge amounts of crawl data (which can contain substantial noise), have been shown to produce high quality text, more often than not judged as human-written (Radford et al., 2019;De Mattei et al., 2020;Brown et al., 2020). Existing evaluations of GPT-2 models (Ippolito et al., 2020;De Mattei et al., 2020) have shown that while generated sentences were ranked lower in human perception than gold sentences, many gold sentences were also not perceived as human-like. To Author contribution note: Lorenzo De Mattei and Huiyuan Lai contributed equally. make the model produce more human-like texts one could train it only on gold data which is highly perceived as human, but such data is costly, and full model retraining is often a computationally nonviable option. As an alternative route, we explore whether and how an existing pre-trained model can be instead fine-tuned to produce more humanlyperceived texts, and how to evaluate this potentially shifted behaviour. We see the advantage of this experiment at least in two ways. One is that the generation of more human-like texts is highly beneficial for specific applications, as for example human-machine interaction in dialogues; the other is that it opens the opportunity to investigate what linguistic aspects make a text more humanly-perceived. We run our experiments on Italian, using GePpeTto (De Mattei et al., 2020) as pre-trained model. First, we collect human judgements on gold texts and texts generated by GePpeTto in terms of how they are perceived (human or automatically produced). We then fine-tune GePpeTto with this perceptionlabelled data. In addition, inspired by the classifierbased reward used in style transfer tasks (Lample et al., 2019;Gong et al., 2019;Luo et al., 2019;Sancheti et al., 2020), we reward the model to push its classification confidence. We evaluate the new perception-enhanced models in comparison with the original GePpeTto by running both an automatic as well as a human evaluation on output generated by the various models. Lastly, we conduct a linguistic analysis to highlight which linguistic characteristics are more commonly found in human-and machine-perceived text.
Contributions We show that a GPT-2 pretrained model can be fine-tuned to produce text that is perceived as more human, and we release this model for Italian. Second, we provide a stronger automatic evaluation method where training is done on perception labels rather than the actual source, which yields results that correlate with human judgments, providing a different angle for automatic evaluation of generated sentences. Lastly, we run a linguistic analysis of the humanly-perceived texts that can open up to new opportunities for understanding and model human-like perception.

Data
We collected human judgments over a series of gold and generated sentences in terms of how much a given text is perceived as human-like. The obtained labelled data is used to fine-tune our base model towards generating more humanly-perceived texts; it is also used to test the resulting models through an automatic evaluation strategy that we implement next to human judgements.
Training Data From the original GePpeTto's training corpus (De Mattei et al., 2020), we collected 1400 random gold sentences in the following way. We sentence split all the documents and we picked the first sentence of each document. In order to allow for length variation, which has an impact on perception, we selected the first 200 sentences with length 10, 15, 20, 25, 30, 35 and 40 tokens. We also let GePpeTto generate texts starting with the first word of randomly selected documents, we sentence-split the generated texts, and select the first 200 sentences with length 10, 15, 20, 25, 30, 35 and 40 tokens. This procedure creates a training set with perception labels containing a total of 2800 instances (1400 gold and 1400 generated).
We asked native Italian speakers if they felt the text they were seeing had been written, on a 1-5 Likert Scale, by a human (1) or a machine (5). Each texts was assessed by 7 different judges. The subjects for the task were laypeople recruited via the crowdsourcing platform Prolific 1 . We did not control for, and thus did not elicit, any demographic features. As a proxy for attention and quality control, we used completion time, and filtered out participants who took too little time to perform the task (we set a threshold of at least 5 minutes for 70 assessments as a reliable minimum effort). 2 1 https://www.prolific.co/ 2 Crowdworkers were compensated with a rate of £5.04 per Mapping the average of human judgements to a binary classification (human if < 3), we obtain the matrix in Tab. 1 showing perception labels and the actual source labels. While human texts are more often perceived as human-like than machinegenerated ones, the matrix shows that 44.2% of the texts are perceived as artificial, suggesting that a good portion of the training data might lead to generation that is not so much human-like. We train two classifiers on 80% of this data on the task of detecting human-like perception and that of detecting the actual source. The classifiers are built adding a dropout (Srivastava et al., 2014) and a dense layer on the top of UmBERTo 3 , which is a Roberta (Liu et al., 2019) based Language Model trained on large Italian corpora. We train them using Adam (Kingma and Ba, 2015), initial learning rate 1e-5, and batch size 16. On the remaining 20% of the data we obtain F=0.97 for the source identification task, and F=0.92 for the perception task, showing the feasibility of the classification and thus the possibility of using these classifiers for evaluation (Section 4).

Models
We use three models for text generation, all based on the GPT-2 architecture (Radford et al., 2019). The basic model is GePpeTto, a GPT-2-based model for Italian released by (De Mattei et al., 2020). The others are built on GePpeTto using estimated hour. In practice, tasks were completed in a shorter time than estimated, so the hourly rate was a bit higher. 3 https://huggingface.co/Musixmatch/ umberto-commoncrawl-cased-v1 the perception-labelled data in fine-tuning and in a reinforcement learning setting.

GePpeTto
GePpeTto is built using GPT-2 base architecture with 12 layers and 117M parameters. GePpeTto is trained on two main sources: a dump of Italian Wikipedia, consisting of 2.8GB of text; and the ItWac corpus (Baroni et al., 2009), which amounts to 11GB of web texts. De Mattei et al. (2020) show that GePpeTto is able to produce text which is much closer to human quality rather than to the text generated by other baseline models. Still, real human-produced text is recognised as such more often than GePpeTto's output.

GePpeTto fine-tuned
Using the original settings of GePpeTto, the model is fine-tuned on the training portion of the humanlyperceived sentences of the perception-labelled data (Tab. 1), using the Huggingface implementation (Wolf et al., 2020). 4 . We use the Adam optimiser (Kingma and Ba, 2015) with initial learning rate 2e-5. The mini-batch size is set to 8. During finetuning, we set an early stopping with patience 5 if the performance on validation does not improve. 5 The resulting model should produce text recognised more frequently as human-produced than the original GePpeTto.

GePpeTto rewarded
To further encourage GePpeTto-F to generate more humanly-perceived texts, we introduce a confidence reward based on the 'perception classifier' (PC) described in Section 2: the model gets rewarded for generating more human-like text. The PC's confidence is formulated as where θ are the PC's parameters, fixed during finetuning GePpeTto . Formally, the confidence is

Evaluation
We with initial learning rate is 1e-5, and set the batch size to 16. We calculate the correlation of the regressor's scores with human judgements over each single data point in the test set (N=1400), and observe good scores (Pearson=0.54 (p < 10 −4 ) and RMSE=0.75).
For the human evaluation, we assign to each sentence the average score computed over all human judgements. We then average all resulting scores over the seven length bins. Results are shown in two tables, as follows.
First, as we did for the training data (see Table 1), we mapped the average of human judgements to a binary classification (human if< 3), and obtain the matrix in Table 2. This shows perception labels and the actual source labels for the three models and gold data. We see that the human produced texts are the most humanly-perceived, but both the fine-tuned and the rewarded model produced texts that are more humanly-perceived than GePpeTto, with the fine-tuned model performing better than the rewarded one.
Second, Table 3 shows the average score over all length bins for the four models: GePpeTto, GePpeTto fine-tuned (GePpeTto-F), GePpeTto rewarded (GePpeTto-R) and the original human texts (Human). This table also reports the average scores over all lengths as assigned by the regressor. 6 The closer to 1, the more humanly-   As a first observation, in both the human and the automatic evaluations the final rank for the systems is the same, showing the reliability of the automatic evaluation. The gold texts are perceived as most human-like by humans (score: 2.41) and by the regressor (score: 2.47). Regarding systems, the fine-tuned model (GePpeTto-F) performs better than both the basic and the rewarded model.
To compare the overall performance of machine vs humans, in Fig 1 we plot the average performance of the three models per length as judged by humans (blue) and the regressor (red). These two lines are compared with gold texts, again assessed by humans (yellow) and the regressor (green).
Comparing the models and the humans as assessed by humans (lines blue and yellow) we see that while for short sentences humans perceive the generated and the natural texts equally human-like, this changes substantially for longer fragments. At length 40, we observe the largest gap in perception between the models and the natural texts, with the latter being perceived much more human-like.
In terms of machine-based evaluation (lines red and green), the behaviour of the BERT regressor on human data is very similar to the human judgements (line green vs yellow). Although the two curves are similar also for the texts generated by the models, the regressor here overestimates as human-produced texts that are actually machine generated (line red vs blue). This is potentially due Figure 1: Average perception scores for human vs machine generated texts as assessed by humans and our regressor. In legend: <producer-assessor>. Machine scores are averaged across the three models.
to the fact that GePpeTto-F and GePpeTto-R use the same (human labelled) training data for finetuning which is used to train the regressor model. This phenomenon appears exacerbated with longer texts, as the blue and red lines are more distant after length 20. 7 This behaviour of the regressor is also reflected by its scores being more compressed towards the middle. Indeed, the average standard deviations in Table 3, show higher variability in human judgements than in the regressor's assessment. In Table 4 same examples of generated sentences together with their scores are reported.

Linguistic Analysis
We ran a linguistic analysis over the human and the generated text using Profiling-UD (Brunato et al., 2020), a tool that extracts linguistic features of varying complexity, ranging from raw text aspects, such as average length of words and sentences, to lexical, morpho-syntactic, and syntactic properties. In particular, we study (i) which features characterise the most humanly-perceived texts in the training data, independently of who generated them; (ii) the difference between human-produced texts and those generated by our best model (GePpeTto-F) in the test set when they are perceived as human. 8 Regarding (i), the features that most correlate with a text being perceived as human have to do with sentence length and complexity. For example, the longer the sentence or the clauses therein, or the longer and deeper the syntactic links, the more humanly-perceived is the text. On the other side of the spectrum, linguistic features associated to texts GePpeTto-R La casa si trova in una posizione favorevole all'espansione del mercato e, in alcuni casi, alla costruzione di tende per bambini. (The house is in a favorable position for the expansion of the market and, in some cases, for the construction of children's tents.) 3.14 2.68

GePpeTto
La squadra era composta di due squadre, una delle quali era la "Rhodesliga" con il termine del "Propaganda Fiumana". (The team was made up of two teams, one of which was the "Rhodesliga" with the term of "Propaganda Fiumana".) 3.15 3.07 judged as machine-generated are heavy presence of punctuation and of interjections and symbols. For (ii), we zoom in on humanly-perceived texts only, but looking at the source that generated them. For human texts, length and complexity are still the relevant features for being perceived as human; these are proxied by complex verbal structures charactersied by auxiliaries, use of past tense, number of main predicates in a sentence. For the generated texts, instead, we observe that both those characteristics that are similar to the human texts, such as the use of the indicative mood and finite tenses, as well as those more specific to machine-generated texts, such as a low density of subordinate clauses and shorter sentences, are simpler structures where it is more likely that the machine does not incur evident mistakes: it is easier for the model to produce human looking sentences if they are kept short and simple. With longer sentences the model struggles to ensure semantic and pragmatic coherence, two aspects that most likely require further and more complex modelling beyond simple fine-tuning.

Conclusions
We elicited judgements on the human-likeness of gold and generated Italian texts and used these judgements to fine-tune a pre-trained GPT-2 model to push it to produce more human-like texts. Our evaluation shows that people indeed find the output of the fine-tuned model more human-like than that of the basic one. Contextually, we show that our proposed automatic evaluation correlates well with human judgements, and it is therefore a reliable strategy that can be applied in absence of subjects.
An analysis of linguistic features reveals that while complexity is associated with humanlikeness in gold data, simplicity is a key feature of artificial texts that are assessed as human-like, perhaps because simpler texts are less prone to expose machine behaviour.
Future work will include an expansion of the perception-labelled data to (i) assess training size in fine-tuning, and (ii) perform a finer-grained analysis correlating assessments to different text genres and subject demographics.

Impact Statement
All work that automatically generates text could unfortunately be used maliciously. While we cannot fully prevent such uses once our models are made public, we do hope that writing about risks explicitly and also raising awareness of this possibility in the general public are ways to contain the effects of potential harmful uses. We are open to any discussion and suggestions to minimise such risks. The contributors of human judgements elicited for this work have been fairly compensated.

A Appendix
This Appendix contains: • detailed results of human and machine evaluation for gold and all models' data (Tables A.1-A.2), expanding the compressed results shown in Table 2 in the main paper.
• details of linguistic features (correlated with human and machine perception, Tables A3-A4) which are discussed in Section 5 in the main paper.