The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT). In this paper, we first conduct a survey of 45 open-ended text generation papers and find that the vast majority of them fail to report crucial details about their AMT tasks, hindering reproducibility. We then run a series of story evaluation experiments with both AMT workers and English teachers and discover that even with strict qualification filters, AMT workers (unlike teachers) fail to distinguish between model-generated text and human-generated references. We show that AMT worker judgments improve when they are shown model-generated output alongside human-generated references, which enables the workers to better calibrate their ratings. Finally, interviews with the English teachers provide deeper insights into the challenges of the evaluation process, particularly when rating model-generated text.


Introduction
Recent advances in neural language modeling have spurred research into open-ended text generation tasks such as story generation (Peng et al., 2018a), style transfer (Krishna et al., 2020), and pun generation (He et al., 2019). Since the space of possible outputs for these tasks is huge compared to more constrained problems such as machine translation, automatic metrics such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) that measure similarity to reference texts are mostly uninformative (Akoury et al., 2020). 1 Human evaluation of model-generated text, which is critical for openended tasks given the unreliability of automatic metrics (Peng et al., 2017;Reiter, 2018;See et al., 2019), is frequently conducted on Amazon's popular Mechanical Turk platform (AMT) to minimize cost and time. Most existing AMT studies ask crowdworkers to provide Likert scale ratings of various properties of generated text, such as fluency and likability.
In this paper, we study the reliability and reproducibility of AMT evaluations of open-ended text generation. We first conduct a survey of papers on open-ended text generation between 2018-2020 and find many critical details often go unreported (e.g., worker qualifications, payment, task descriptions, annotator agreement), a finding in line with prior reproducibility studies outside open-ended text generation (Card et al., 2020;Howcroft et al., 2020;van der Lee et al., 2021).
Next, we perform a series of story generation evaluations with both AMT workers and expert raters (English teachers), applying a variant of the most common task configuration that appeared in our survey (5 point Likert scale ratings of 200 examples with three annotators per example) to the paragraph-length WritingPrompts dataset of Fan et al. (2018). Unlike prior work in this area, we ask raters to evaluate both stories generated by a fine-tuned GPT-2 language model (Radford et al., 2019) and human-written reference stories on the same scale, as we expect the latter to consistently score higher on all evaluations. Our experiments expose and quantify several troubling trends: 1. AMT ratings do not reliably distinguish model-generated text from human-generated text unless workers are asked to rate both sideby-side, which allows them to better calibrate their ratings. 2. Running an identical task (same AMT parameters and input data) on different days of the week exhibits high variance and can lead to dubious conclusions (e.g., that reference texts are lower quality than GPT-2 generated text). 3. Many AMT workers do not carefully read the text that they are evaluating. Even after enabling multiple qualifications to exclude lowquality workers, 42% of workers on average take fewer than 40 seconds to complete each task. Filtering out these workers can make a significant impact to the overall ratings, but also notably reduces the number of datapoints. 4. Even expert raters struggle to read and judge model-generated text. The time they spend per example increases significantly compared to that for references, and agreement also drops.
For future human evaluations of open-ended text generation tasks, we urge researchers to obtain expert raters whenever possible. If AMT is the only feasible option, we recommend that available reference outputs also be evaluated alongside modelgenerated ones to improve rating calibration, and also that heavy filtering of the worker population (possibly through qualification tasks, or post-hoc removal) is performed prior to reporting results.

A survey of papers that evaluate open-ended text generation with AMT
We begin with a survey of 45 papers that use AMT to evaluate the output of open-ended Englishlanguage text generation models, which includes generated stories, metaphors, paraphrases, puns, sarcasm, and sentences with transferred style or attributes 2 . Each paper was published between 2018 and 2020 at ACL, NAACL, or EMNLP, and we exclude papers that use AMT to evaluate more wellestablished generation tasks like machine translation, or summarization. 3 Unlike previous surveys of evaluating generated text (Çelikyilmaz et al., 2020;van der Lee et al., 2021), we focus specifically on AMT evaluations of open-ended text generation. In this section, we provide an overview of the different types of evaluation task setups present in our survey; later, we experiment with several variants of the most common setup.
Evaluation criteria: As in the survey of Howcroft et al. (2020), we observe a variety of different evaluation criteria and definitions of these criteria across the 45 papers. The most common criteria include fluency and/or grammaticality (19), overall "quality" (12), and relevance (10) to a corresponding prompt. Furthermore, stories in particular tend to also be evaluated on some notion of coherence (9) and likability (4).
Rating scales: More than half of the papers (24) employ a 5-point Likert scale to evaluate the above criteria; of these, 19 provided labels for just the end points of the scale (e.g., "lowest" vs. "highest"), while 5 labeled all points on the scale. The next most common evaluation type is ranking two or more system outputs (23). Less common are other Likert scales (3, 4, or 6-point), pass/fail tasks, and output-prompt matching tasks.
Number of raters and rated items: An alarming number of papers (14) do not even report the number of raters and/or items (6) used for evaluation. Of the remaining, most papers (16) obtain ratings from 3 separate AMT workers per item. The most common number of items per evaluation is 100 (14). The number of raters per item in other studies ranges from 2 to 11, while the number of items ranges from 12 to 1,000.

Workers qualifications and compensation:
The vast majority of papers do not report AMT worker qualifications (32) or worker compensation (35), which adds to the reproducibility woes. Among papers that report qualifications, the most common were HIT 4 approval rate ≥ 90%-99%, and number of approved HITs between 500 to 5,000. Only 11 papers mentioned restricting workers to those from English-speaking countries or applying some kind of language test, despite all evaluations being done on English text.
Length of the Rated Text: As open-ended text generation encompasses an array of different tasks, the length of the rated text differed greatly, ranging from single sentences (28), sometimes presented in a longer context, to short paragraphs (7), and longer paragraphs (14). The latter setting is most commonly used for story generation tasks.

Evaluating story generation with AMT
Our survey reveals that the most popular Mechanical Turk task design for open-ended text generation asks AMT workers to rate various properties of generated text on a 5-point Likert scale. In this section, we conduct a series of AMT evaluations for the open-ended problem of story generation by varying different parameters within this standard task design. Importantly, we evaluate both modelgenerated stories as well as human-generated refer-ence stories, which provides a pseudo upper bound for the ratings. Our experiments reveal that worker qualifications (e.g., HIT approval rate and number of accepted HITs) do not notably impact judgments or spam rate on reference stories, with the exception of country of origin. Furthermore, we uncover an issue with rating calibration: when both reference and model-generated stories are included for the same prompt, average reference scores are significantly higher than those for model-generated text; however, when workers only see one type of text per HIT, they give similar average scores to both types.

Experimental Setup
We first describe the parameters of our experiments before later analyzing the results.
Dataset: We use the WritingPrompts dataset collected by Fan et al. (2018), which is a collection of 303,358 English language stories written by Reddit users on the r/WritingPrompts subreddit. 5 This dataset, which consists of short prompts paired with user-written stories (e.g., "There are 10 legendary dentists who review every toothpaste. You are the 10th... being hunted by the other 9..."), has been used in multiple previous works on paragraph-length story generation (Fan et al., 2019;See et al., 2019;Mao et al., 2019). We randomly select 200 prompts from the test set for all of our experiments. Since the human-written stories in the dataset are already tokenized, we first de-tokenized the stories, cleaned up artifacts from lemmatization, and manually truncated each story so that it ends with a full sentence and is no longer than 150 words in order to make the length comparable with the machine-generated story. 6 We use the resulting stories for all experiments with reference text.
Model-generated stories: We follow a similar modeling approach to prior story generation work (Mao et al., 2019;Guan et al., 2020) by fine-tuning a pretrained GPT-2 medium-sized model (Radford et al., 2019) on the training set of the WritingPrompts dataset, using the Hugging-Face Transformers library (Wolf et al., 2020). We use a batch size of approximately 50k tokens, a learning rate of 5e − 5 with a linear learning rate schedule, and train for 3 epochs, stopping training after validation perplexity converges to ∼ 19. Each training example consists of a concatenation of a prompt, separator token (new line character), and reference story. At test-time, we feed the same 200 prompts selected above to our model for fair comparison to the human-written stories, and we generate three stories per prompt using nucleus sampling (Holtzman et al., 2019) with p = 0.9. We manually truncate each sample so that it ends with a full sentence and is no longer than 150 words. 7 These stories are used in all experiments evaluating machine-generated stories.
AMT task parameters: We conduct all experiments using the default interface in Mechanical Turk (see Figure A1 and Figure A2). Workers were asked to rate human-written and/or machinegenerated stories on four attributes, with the following definitions provided to them: 1. Grammar: "How grammatically correct is the text of the story fragment?" 2. Coherence: "How well do the sentences in the story fragment fit together?" 3. Likability: "How enjoyable do you find the story fragment?" 4. Relevance: "How relevant is the story fragment to the prompt?" Their ratings fall on a 5-point Likert scale with the corresponding endpoints labelled as "lowest" (1 point) and "highest" (5 points). Since our survey did not find many previous papers that reported using detailed descriptions for each point on the scale, we chose to use minimal labels to mimic the most popular setup (see Section 2 for details).
Each of our AMT experiments shows workers the same 200 prompts paired with human and/or machine-generated stories, and we solicit three worker judgments per HIT. Workers were paid $0.20 per HIT for tasks that showed one story, and $0.35 per HIT for those that showed two stories; in total, our AMT experiments cost roughly $1.5K. Importantly, each experiment used a completely different set of workers (i.e., each worker could only participate in one experiment, although they can complete multiple HITs within that experiment), which is an intentional choice to prevent workers from judging the same story multiple times. Finally, to eliminate potential variations stemming from evaluation on different days (weekdays vs. weekends) and time of day, we launch all experiments on weekdays between 11:00-11:30AM PST.

AMT Evaluations of Reference Text
Our first set of experiments concerns only humanwritten reference stories; we move to machinegenerated text in the next subsection. One of our assumptions with human-written stories, supported by the expert teacher assessment in Section 4, is that they should receive relatively high scores for all four properties (except perhaps likability which is highly subjective). We thus use reference texts to evaluate various AMT parameters such as qualifications or day of task launch, observing how modifications to these parameters affect the average scores of reference text.
Impact of worker qualifications: We run four experiments evaluating the previously-described set of 200 prompts with reference story fragments, varying the worker qualifications as follows: (1) no qualifications, (2) including only workers with HIT approval rate > 90%, (3) including only workers with approval rate > 90% and at least 1000 approved HITs, (4) including only workers with approval rate > 90% and at least 1000 approved HITs who are located in English-speaking countries. 8 The results in the top portion of Table 1 suggest that applying all of the qualifications (i.e., workers from English-speaking countries, approval rate > 90%, approved HITs ≥ 1000) has a positive effect on the quality of workers, as this setting yielded the highest scores out of the four experiments for coherence and relevance while ratings for grammar were also considerably high. Ratings for likability were lower than in the experiments with less strict qualifications, but likability is a very subjective measure which consistently shows a very low agreement (Krippendorff's α of -0.04 to 0.11). When all AMT worker qualifications are enabled, the worker ratings more closely align to those made by English teachers, although there are still substantial deviations (Section 4). Additionally, with all qualifications enabled, workers show higher agreement for grammar, coherence, relevance and even likability, although the agreement between raters remains low.
High variance across different days: Concerned by the low overall agreement, we decided  to run another set of experiments that repeats the same experiment (all qualifications enabled) across three different days. Due to our constraint that each worker can only participate in one experiment, each of these experiments has a different subset of qualified workers. As shown in the second portion of Table 1, although the first and third days yielded similar mean ratings/agreement in terms of grammar (M=4.00, IAA=0.21 vs M=3.98, IAA=0.18) and coherence (M=4.11, IAA=0.14 vs M=4.05, IAA=0.13), the second day received lower ratings across the board and had overall poor IAA (see Table 1). Furthermore, ratings for relevance in the third day (M=3.46) were significantly lower than in the first two days (M=3.71), which indicates that simply using all AMT qualifications is not enough to achieve consistent results.
Many AMT workers do not spend enough time reading the stories: The low overall agreement also motivated us to examine the average time each worker spent per HIT. While AMT reports Work-TimeInSeconds in the results file made available to task requesters, we observe similar to Akoury et al. (2020) that these times are artificially inflated due to workers who accept multiple HITs at the same time and work on them sequentially (e.g., in different tabs). Such workers are also frequently among the most prolific in terms of HITs completed per experiment (see Figure 2), since there is no maximum number of HITs per worker. 9 We correct for this by measuring the time between consecutively submitted HITs by the same worker, which can be derived by analyzing start and end times of each HIT. This "actual time" differs considerably from the AMT reported WorkTimeInSeconds: for instance, a worker that AMT reports had a mean work-time of 360 seconds had an actual mean working time 10 of 22s and a median of 13s. To put these numbers in perspective, this is about one-fourth of the time that the fastest English teacher achieved (see Section 4).
As it is impossible to carefully read a paragraphlength story and assess all four properties in as little as 13 seconds, we measure the impact on average 9 Like most AMT tasks (Fort et al., 2011;Robinson et al., 2019), the majority of HITs for our evaluations are provided by a small fraction of workers. The majority of workers provided ratings for only one or two stories while a very few productive workers rated over 50% of the stories (see Figure 2). 10 The mean work time is also not very representative as workers typically accept multiple HITs, wait a period of time, then submit all accepted HITs in quick succession. ratings when filtering out workers who spend too little time per HIT (last row of Table 1). Specifically, we remove judgments from workers whose median time is below 40s (which is a low bar), and find that on average about 42% of our ratings are filtered out (ranging from 20%-72% across all experiments). 11 Of our surveyed papers, only Akoury et al. (2020) report actual work time, demonstrating that this is a major issue in modern AMT evaluations of text quality that most researchers have overlooked.
Impact of worker country of origin: While all of the surveyed papers evaluate only English text, only 11 of them reported using some kind of filtering to ensure that workers have sufficient knowledge of English. The default AMT setting does not filter workers by country of origin, which potentially increases the variance of results depending on the English proficiency of workers who accept HITs. To measure this, we re-run our experiment with all qualifications, except we restrict the task to only workers from countries that do not primarily speak English (i.e., we exclude workers from the US, Canada, UK, Australia, New Zealand, Ireland, and Singapore). The third portion of Table 1 shows that workers from non-English speaking countries rated coherence, relevance, and grammar 12 significantly lower than identically-qualified workers from English-speaking countries (Day 1-3). Thus, researchers rating English text should restrict their tasks to English-speaking countries, although Kennedy et al. (2020) find that many workers use Virtual Private Networks (VPNs) to take part in tasks restricted to those in the US.

Evaluating Machine-Generated Text
We now turn to AMT evaluation of machinegenerated stories produced by the GPT-2 model described in subsection 3.1. Based on our previous experiments with reference texts, we select the "all qualifications" setting (i.e., workers from Englishspeaking countries, approval rate > 90%, approved HITs ≥ 1000) for all GPT-2 AMT tasks. We study 11 We also ran experiments with even stricter qualification filters (i.e., acceptance rate ≥ 99% and at least 10,000 approved tasks), but this made no notable difference to the percentage of data being filtered out (35%). This is most likely due to the fact that most requesters are reluctant to reject HITs regardless of quality, which results in an estimated 95% of workers having an approval rate of 98% or above (Matherly, 2019;Wessling et al., 2017). 12 There was no significant difference between grammar ratings collected from raters from non-English speaking countries and ratings collected on Day 2. two different conditions: (1) HITs contain a prompt and a GPT-2 generated text, and (2) HITs contain a prompt and both a human-written reference story as well as a GPT-2 generated story. In the latter case, we ask AMT workers to rate both texts on each of the four properties. Overall, we observe that workers cannot effectively distinguish between reference and model-generated stories when they are evaluated separately (in terms of average ratings), but that this distinction emerges clearly when they are presented with both types of stories in the same HIT.
When presented only GPT-2 generated text, AMT worker ratings rate them similarly to reference texts, despite obviously worse quality: In our first experiment, we follow the protocol from our experiments with human-written reference stories, showing AMT workers a prompt and a model-generated story and asking them to rate it on the same attributes (grammar, coherence, relevance, and likability). The results of this evaluation are presented in the upper row of Table 2 along with the three sets of ratings of reference stories obtained with the same "all qualifications" setting from before (Days 1-3 in Table 1).
Surprisingly, GPT-2 output is not consistently rated significantly lower than human-written text. For instance, workers in Day 2 rated human-written stories similarly to the GPT-2 generated stories in terms of grammar (M=3.86 vs. M=3.94) and coherence (M=3.92 vs. M=3.82), while workers in Day 3 rated human-written stories as similarly relevant to the prompt as GPT-2 output (M=3.46 vs. M=3.44). Depending on which reference day we compare the GPT-2 output to, GPT-2 is rated similarly to human-written stories in terms of all four properties, which indicates that this evaluation is uninformative; nevertheless, the majority of surveyed papers use exactly this task design to obtain ratings for model-generated output.
Asking workers to rate both human-written and model-generated stories side-by-side improves ratings: We hypothesize that the previous result is due to scale calibration differences between the two settings: when repeatedly confronted with incoherent model-generated text, a worker may be more generous with their ratings compared to if they only see coherent human-written text. Thus, we explore whether their ratings can be better calibrated by asking them to rate both types of

Evaluation by expert teachers
The experiments in the previous section demonstrate the unreliability of AMT ratings for openended text generation, even when qualifications are used to restrict the task to ostensibly reliable workers. In this section, we compare the ratings produced by AMT workers to those of expert raters, specifically a set of three English teachers, and discover significant deviations between the two groups. Though they rated both types of stories separately, their ratings clearly distinguish between human-written references and machine-generated stories. We also conducted post-task interviews with the teachers and organized a mediation session to discuss stories with high disagreement, observing that they reach consensus after discussion in about 80% of cases.
Recruiting English teachers: We choose English teachers as experts for our story generation task because they regularly evaluate student-written papers and are experienced at detecting both lowlevel grammatical mistakes as well as discourselevel issues with logical coherence. The three teachers were recruited from the authors' personal networks, and each of them either has a degree in teaching English as a Second Language or a CELTA certificate. 13 They were paid $125 each for participating in our experiments, which required them to rate the same 200 human-written stories and 200 GPT-2 generated stories on the same four properties as that of the AMT workers, given an identical task interface. 14 Unlike AMT workers, teachers rate reference stories higher than GPT-2 generated ones: We asked teachers to first rate the 200 reference stories, and then a week later to rate the GPT-2 generated stories. Just like the AMT workers, they were not told that the text in the second task was machinegenerated. Importantly, we used the same set of teachers for both tasks, so they already had significant experience with the task when rating the machine-generated text (as opposed to using new AMT workers for each experiment). The results of this evaluation are presented in the last row of Table 2. Unsurprisingly, teachers rated human-written stories significantly higher than GPT-2 generated stories in terms of coherence (M=4.38 vs. M=3.73), relevance (M=3.82 vs. M=2.54), and likability (M=3.69 vs. M=2.96) (all p's<0.001). On the other hand, they rated humanwritten stories and GPT-2 generated stories as similar in terms of grammar (M=4.50 vs. M=4.56). Moreover, teachers' ratings of human-written stories are considerably higher than AMT ratings for all attributes except likability (M=3.69) which depending on the day was rated lower (M Day1 =3.37) or higher (M Day2 =3.73) by the AMT workers. Similarly, teachers' ratings of GPT-2 stories are lower than the ratings we obtained from AMT workers for coherence (M=3.73 vs. M=4.11), relevance (M=2.54 vs. M=3.71), and likability (M=2.96 vs. M=3.37).
Teachers need to see many examples to properly calibrate their ratings: In post-task interviews, all teachers reported that it took them 10-20 stories on average to calibrate their ratings. Since most AMT workers complete only one to two HITs, they do not have similar time to get acquainted with the task; this may suggest that having a pre-task training phase can improve worker calibration.
Coherence is difficult to rate for machinegenerated text: The teachers unanimously report that while coherence is easy to rate for reference stories (since most of them are largely coherent), it is the most difficult property to rate for GPT-2 generated stories. Since they did not know that they were rating machine-generated text, they spent time trying to make sense of the author's possible intent in producing many of the strange artifacts and hallucinations common to output of neural language models (Holtzman et al., 2019). In contrast, relevance turned out to be the easiest property of machine-generated text for teachers to rate, which is expected as many of GPT-2's stories deviate very quickly from the prompt (see Figure 1).
GPT-2 generated stories are much harder for teachers to rate overall: All teachers reported struggling more when rating GPT-2 stories, a fact reflected in their average rating time per story increasing significantly from 69.8 seconds to 87.3 seconds (p<0.05). In contrast, the average rating time of AMT workers decreased from 135.3 seconds for human-written text (Day 1) to 91.5 seconds for GPT-2 text (p<0.05) 15 . Teachers also reported having to recalibrate their scale when rating the GPT-2 generated stories, as the stories were significantly worse than the human-written text. Consequently, they suggested that it would be easier to calibrate their scale had the GPT-2 output been presented beside the human-written text, which supports the results from our joint rating task with AMT workers. Finally, the teachers suggested that creating a standardized rubric would greatly facilitate the rating process. This step is even more important as machine-generated text faces different issues than human-written text.
Resolving teacher disagreement: One advantage of using human expert raters is that we can easily have them discuss examples on which they disagree. We arranged a mediation meeting between two of the three teachers to discuss 60 stories on which they showed the highest disagreement (3 attributes × 10 stories × 2 types, we excluded likability due to its subjective nature). In this meeting, they were first asked to rate the stories again, without being provided their previous rating. In about 20% of cases, one of the teachers disagreed with their own previous rating due to honest lapses of judgment. Another common reason for disagreement was missing world knowledge (see Figure 1, right). One more reason for disagreement, a confusion about how to rate slang in terms of grammaticality. While the text was not correct in the view of the official grammar, it was appropriate for the prompt, so one teacher rated it high while the other rated it low. Overall, after discussing examples that they still disagreed on after re-rating, teachers were able to come to a consensus on 80% of the stories; the remaining disagreements persisted due to individual differences in strictness. See Appendix C for details on the mediation meeting.
Replicating the study on Upwork: We recognize that replicating our study is difficult without access to a network of English teachers. As such, we performed the same experiment using three certified teachers recruited on a freelance platform, Upwork. 16 The teachers were paid $175 for evaluating the same 200 human-written and 200 GPT-2 generated stories using the exact same setup as in subsection 3.1. It took approximately one week to collect the data (including break between rating human-written and GPT-2 generated stories). The results obtained via Upwork were comparable with the results obtained from the English teachers de-scribed in this section, i.e. the Upwork teachers rated human-written stories higher for coherence, relevance, and likability than the GPT-2 generated stories (all p's<0.001). Interestingly, their IAA was higher than the English teachers recruited from the authors' personal networks. The details of this experiment are provided in the Appendix B.

Related Work
Our work is related to previous studies of human evaluation of text quality as well as collecting judgments using Amazon Mechanical Turk.
Even professional translators struggle when evaluating longer machine translated texts (Castilho, 2021). Creative texts, such as stories, are less constrained than translated texts, but researchers continue to employ crowd workers to evaluate creative texts, often without evaluating reference texts (see Section 2). Previous studies have asked workers to choose from (Mori et al., 2019) or distinguish between human-written and machine-generated texts (Garbacea et al., 2019;Ippolito et al., 2020;Clark et al., 2021).
Data collection using AMT: Many previous works raise concerns about the reliability of data collected on AMT (Necka et al., 2016;Matherly, 2019;Ahler et al., 2020). Reluctance of requesters to reject HITs leads to positive bias in workers' qualifications (Matherly, 2019). Furthermore, a large number of responses are provided by small number of productive workers (Fort et al., 2011;Robinson et al., 2019). Researchers also report an increasing number of workers use VPNs to mask their location (Bauer et al., 2020) and contribute lower-quality data (Moss and Litman; Ahler et al., 2020). Hence, simple quality control measures, such as approval rate or the country of residence as suggested in (Berinsky et al., 2012), may not be sufficient to effectively filter workers who are spamming a task.

Recommendations & Conclusion
Our experiments show that evaluating open-ended generated text is an incredibly challenging task even for expert raters. While AMT is a convenient and affordable solution, we observe that high variance between workers, poor calibration, and cognitively-demanding tasks can lead researchers to draw misleading scientific conclusions (e.g., that human-written text is "worse" than GPT-2's). Simple fixes such as adding strict worker qualifications do not address the root of the problem. As such, we recommend future AMT evaluations implement additional quality control mechanisms (some of which require custom task setups on external servers) such as (1) filtering workers by observed time spent per HIT rather than WorkTimeInSeconds, (2) specifying a maximum number of items per worker, (3) employing a pre-task language proficiency test, and (4) providing training HITs to allow workers to calibrate their ratings. Furthermore, we show that researchers can improve rating calibration by presenting machine-generated text alongside human reference text. That said, expert raters such as linguists or language teachers should be used whenever possible as they have already been trained to evaluate written text, and it is not much more expensive (it cost us $144 to rate 200 stories with AMT vs. $187.50 with English teachers vs. $262.5 with Upwork.

Ethical Considerations
As with all research that makes use of human subjects, we must carefully reflect on our methodology to minimize the risk of harm to those we ask to evaluate open-ended texts. Specifically, texts from social media sites like Reddit may contain racist, sexist, and other forms of vulgar content. Additionally, neural language models like GPT-2, which have been trained on open domain text crawled from the web, have been shown to generate similarly offensive content. As such, we advocate adequately warning any humans who take part in open-ended text evaluation of the potential for such harms (as we did in our research).
Additionally, crowd workers are frequently underpaid for their labor, which harms both the quality of the research, and more importantly, the ability of these crowd workers to earn an adequate living. As such, we report our hourly wage for both crowd workers and experts. We ensure that crowd workers earn at least $14 per hour by assuming 50-55 seconds per HIT (though on average our crowd workers were paid substantially higher due to the low average time to completion on each HIT). Our experts averaged around $20 per hour (not counting mediation).

B Collecting Ratings on Upwork
We also hired three teachers using the freelancing platform Upwork 1 . The teachers were paid $175 to evaluate the same 200 human-written stories and 200 GPT-2 generated stories. They were asked to perform the ratings on the AMT platform in order to use the same interface as workers on AMT. Similarly to the teachers recruited from the authors' personal network, the teachers recruited on Upwork were asked to rate the 200 human-written stories first and then, after a few days break, provide the ratings for the GPT-2 generated stories. Furthermore, Upwork teachers also held TEFL, 2 TESOL, 3 or CELTA 13 certificates. Table A3 shows mean ratings and agreement for the data collected on Upwork. Similarly to the results described in Section 4 and summarized in Table A5, the average scores for coherence, relevance, and likability are higher for the human-written stories than for the GPT-2 generated stories (see Table A4).  Table A4: Welch's t-test for ratings collected on Upwork (human-written stories vs GPT-2 generated stories). Human-written stories were rated higher on coherence, relevance, and likability than GPT-2 generated stories. These results are similar to the one obtained from English teachers described in Section 4.

C Details on Post-rating Interviews
Two mediation meetings were organized with two of the three teachers (due to availability) over Zoom 4 . The teachers were asked to reevaluate 60 stories on which they showed disagreement (3 attributes × 10 stories × 2 types; likability was excluded due to its subjective nature). Each meeting took approximately 2h (including a short break) and was led by one of the authors. The teachers were shown one story at a time and were asked to reevaluate it on the given attribute. In about 20% of the cases, the teachers agreed with each other, suggesting that the previous disagreement was due to honest lapses of judgment. As for the cases where disagreement occurred, each was asked to provide a justification for their ratings. Often hearing the other party's argument enabled them to see the text from a different perspective and understand the ratings of the other person. This process often resulted in them adjusting their own ratings. Common reasons for disagreement which could be resolved during the mediation meeting included: world knowledge, difference in understanding of the prompt and its relation to the text (e.g., prompt enforcing specific style), difference in the way they treated author's comments which were sometimes present at the beginning of the story, and rationalizing connections between the sentences. After each batch, consisting of ratings of both human-written stories and GPT-2 generated stories, each of the three teachers took part in a short oneon-one interview (∼ 10min each). They were asked the following questions:  Table A5: Welch's t-test for ratings collected in the experiment described in Section 4 (teachers' ratings). Humanwritten stories were rated higher for coherence, relevance, and likability than GPT-2 generated stories.  Table A6: One-way ANOVA investigating the effect of group (Day 1, Day 2, Day 3, and workers from non-Englishspeaking countries) on the ratings of grammar of the reference texts. Partial eta squared (η 2 p ) is provided for the effect size (η 2 p = 0.01 indicates small effect size; η 2 p = 0.06 indicates medium effect size; η 2 p = 0.14 indicates large effect size (Cohen, 1988) Table A8: One-way ANOVA investigating the effect of group (Day 1, Day 2, Day 3, and workers from non-Englishspeaking countries) on the ratings of coherence of the reference texts. Partial eta squared (η 2 p ) is provided for the effect size (η 2 p = 0.01 indicates small effect size; η 2 p = 0.06 indicates medium effect size; η 2 p = 0.14 indicates large effect size (Cohen, 1988)).   Table A10: One-way ANOVA investigating the effect of group (Day 1, Day 2, Day 3, and workers from non-English-speaking countries) on the ratings of relevance of the reference texts. Partial eta squared ( η 2 p ) is provided for the effect size (η 2 p = 0.01 indicates small effect size; η 2 p = 0.06 indicates medium effect size; η 2 p = 0.14 indicates large effect size (Cohen, 1988) 1.00 0.14 NNS <0.001 <0.001 <0.001 Table A11: Pairwise post hoc test with Bonferroni adjustment for the ratings of relevance between Day 1, Day 2, Day 3, and non-English speaking countries (NNS). The numbers provided in the table are p-values for the given pairwise comparison. Ratings obtained from workers from non-English speaking countries differ significantly from ratings obtained from workers from English-speaking countries on Day 1, Day 2, and Day 3. Furthermore, there is a significant difference between ratings collected on Day 1 and Day 2.  Table A12: One-way ANOVA investigating the effect of group (Day 1, Day 2, Day 3, and workers from non-English-speaking countries) on the ratings of likability of the reference texts. Partial eta squared (η 2 p ) is provided for the effect size (η 2 p = 0.01 indicates small effect size; η 2 p = 0.06 indicates medium effect size; η 2 p = 0.14 indicates large effect size (Cohen, 1988) Table A14: Welch's t-test on ratings collected on AMT for human-written stories (Day 1) and GPT-2 generated stories. Human-written stories are being rated higher for coherence and more relevance than GPT-2 generated stories (p<0.05).  Table A15: Welch's t-test on ratings collected on AMT for human-written stories (Day 2) and GPT-2 generated stories. Human-written stories were rated higher for relevance and likability than GPT-2 generated stories (p<0.05).  Table A16: Welch's t-test for ratings collected on AMT for human-written stories (Day 3) and GPT-2 generated stories. Human-written stories were rated higher for coherence than GPT-2 generated stories (p<0.05).  Table A17: Welch's t-test for ratings collected on AMT for human-written stories (non-English speaking countries) and GPT-2 generated stories. GPT-2 generated stories were rated higher for grammar, coherence, and relevance than human-written stories (p<0.05)  Table A18: Welch's t-test for ratings collected on AMT for human-written stories and GPT-2 generated stories (both stories shown in one HIT). GPT-2 generated stories were rated lower for coherence, relevance, and likability than human-written stories (p<0.05) which is in line with the ratings provided by English teachers.