Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

There are many ways to express similar things in text, which makes evaluating natural language generation (NLG) systems difficult. Compounding this difficulty is the need to assess varying quality criteria depending on the deployment setting. While the landscape of NLG evaluation has been well-mapped, practitioners' goals, assumptions, and constraints -- which inform decisions about what, when, and how to evaluate -- are often partially or implicitly stated, or not stated at all. Combining a formative semi-structured interview study of NLG practitioners (N=18) with a survey study of a broader sample of practitioners (N=61), we surface goals, community practices, assumptions, and constraints that shape NLG evaluations, examining their implications and how they embody ethical considerations.


Introduction
Evaluating natural language generation (NLG) models and systems-that generate new or altered text-is notoriously difficult, as often there are many valid ways to express similar content in text. This difficulty is often compounded by the need for NLG systems to meet a variety of goalsoften measured in competing and imperfect waysdepending on their deployment settings (Sai et al. These challenges also mean that practitioners seeking to evaluate their NLG systems must decide on their goals for the system (e.g., help users write fluently), what quality criteria to use to reflect those goals (e.g., output fluency), and how to operationalize the measurement of those criteria (e.g., perplexity or crowd judgments). Such decisions are often guided by deployment settings, by evaluation norms in a research community (Gkatzia and Mahamood, 2015), by assumptions about "proper" NLG model or system behavior, as well as by realworld constraints (Sai et al., 2022;Gehrmann et al., 2021). Although the landscape of quality criteria and operationalizations has been surveyed, much less is known about how practitioners' goals, assumptions, and constraints shape their decisions about which criteria to use, when, how, and why. Limited visibility into how these factors shape NLG evaluations can, however, make it difficult to anticipate issues these factors might give rise to, or even what we can learn about research progress.
In this paper, we surface goals, assumptions, community practices, and constraints that NLG practitioners 1 work with when evaluating NLG systems, including limitations and ethical implications. These tacit elements of NLG evaluation are challenging to ascertain from literature surveys, as they are often unstated or only partially stated.
To reveal otherwise latent aspects of NLG evaluation practice, we apply a mixed-methods approach from human-computer interaction (HCI). Specifically, we conduct a formative semi-structured interview study (N=18) and a survey study (N=61) of NLG practitioners ( §2) in order to identify common themes related to practitioners' goals ( §3.1), practices ( §3.2), and assumptions and constraints ( §3.3) when evaluating NLG systems, including ethical considerations ( §3.4). By recognizing more tacit elements of NLG evaluation, our work aims to provide scaffolding for discerning the issues they may give rise to and help re-think NLG evaluations.

Background and Methods
To examine NLG evaluation practices-along with the goals, assumptions, and constraints that inform them-and whether they incorporate assessments of possible adverse impacts, we first conducted a formative semi-structured interview study of NLG practitioners ( §2.1). To probe the themes emerging from our interviews at greater scale, we subsequently conducted a survey study with a broader sample of practitioners ( §2.2). Participant recruitment and IRB. We recruited NLG practitioners using snowball sampling (Parker et al., 2019), via targeted emails and social media posts, with a request to share with relevant groups. Each interviewee received a $25 gift card for a 45 minute video-conference interview, while survey participants could enter a raffle for one of ten $50 gift cards. Both studies were IRB approved, and we obtained informed consent from all participants.

Interview Study
To design our study, we drew on the measurement modeling framework (Jacobs and Wallach, 2021) to tease apart practitioners' conceptualization of quality criteria from how they measure those criteria. We also investigated how current NLG evaluation practice grapples with possible adverse impacts, such as fairness and inclusion issues (Sheng et al., 2021;Weidinger et al., 2021).
We began our interviews with a) background questions on participants' NLG projects and experience to establish context. Subsequent questions broadly asked about b) practitioners' overall project goals, c) the quality criteria they want to assess their system on, and d) how they measure those criteria-with c) and d) following the measurement modeling framework. We also asked participants about adverse impacts of their work, and whether they measure or assess them.
Before conducting the final interviews, we first ran pilot interviews (N=5) with practitioners from both product and research organizations to identify possible clarity issues with our questions and study protocol. The full interview protocol is available in Appendix A. Overall, we interviewed 18 practitioners from 12 organizations (Table 1).
Identifying interview themes. To analyze interview transcripts, we used a bottom-up approach rooted in grounded theory (Strauss and Corbin, 1997; Charmaz, 2014), following Robertson et al. (2021). We iteratively coded and thematically sorted interview excerpts by looking for relations with or among already assigned codes. Specifically, we first distributed the interview transcripts across all authors for open coding, with the resulting codes being then discussed by all authors to identify and agree upon clusters of common themes.
To organize codes into themes, we drew again on measurement modeling to distinguish between codes for what practitioners want to measure (quality criteria) and codes for how practitioners operationalize those measurements (evaluation practices). To examine which factors affect practitioners' decisions about what criteria they prioritize and operationalize, we also consider the goals, assumptions and constraints that underpin their evaluation practices, clustering codes into related themes. Finally, we identify themes on how practitioners conceptualize and embody possible adverse impacts.

Sector
Interview Survey

Survey Study
We designed our survey around themes from the interview study, starting again with questions about participants' background, and organizing the remaining questions under a few broad themesa) quality criteria and goals, b) evaluation practices, c) evaluation rationales (including assumptions and constraints), and d) adverse impacts and ethical considerations-with different sets of questions matching different themes. Similarly with the interview study, before sharing and advertising our survey, we piloted it with a few industry researchers (N=3) to identify possible clarity issues with our survey questions and protocol. See Appendix B for full survey script. To filter out spammers, we relied on 3 of the open-text questions-the description of participants' occupation, description of their NLG project, and their reflections on how to improve NLG evaluation. After removing spammers, we analyzed responses from 61 participants (Table 1).
Across both studies, we observe common themes in participants' responses that we overview below. Interview participant quotes-anonymized and paraphrased for brevity and clarity-are followed by "P" and a participant ID, while quotes from survey responses are only marked by "SP." Statistics based on survey responses are followed by "SQ" and the question number. We report disaggregated statistics as (X% academic, Y% non-academic) when gaps exist between academic and non-academic participants. We note, however, that the statistics we report are only indicative as our respondent sample is skewed towards industry and academic researchers (Table 1), and not necessarily representative of the broader NLG community.

Evaluation Goals and Quality Criteria
What practitioners want their NLG systems or models to do (goals) and what they aim to assess to determine whether those goals are met (quality criteria) both shape which evaluation practices they are likely to follow. Participants' reflections on what success looks like for their projects and which quality criteria they deem critical could thus help discern the implications of the resulting practices. While practitioners conduct NLG evaluations to report to a variety of audiences, paper reviewers and the research community are most often mentioned. While we see differences between academic and non-academic participantswith the latter thinking more about specific audiences like their team or manager (19% academic, 53% non-academic) or other teams (16% academic, 50% non-academic)-77% of survey participants say one goal of their evaluation was to report results in a paper (90% academic, 63% non-academic) and 59% to report to a research community (67% academic, 50% non-academic) [SQ13]. If reporting results in papers or to research communities represents a significant goal of many practitioners, however, then evaluations already legible to a research community may be over-incentivized, and quality criteria and measurements unfamiliar or uncommon to that community may be discouraged.
Quality criteria like correctness, usability, coherence, and fluency are often prioritized. While our participants mention a diversity of criteria, 77% of survey respondents deemed correctness among most important criteria (83% academic, 70% non-academic), followed by 49% usability (38% academic, 60% non-academic), 47% coherence, and 44% fluency; with criteria like readability (21%) and clarity (21%) being least frequently selected [SQ14]. Without this clarity, we may focus too strongly on a metric-like perplexity-and risk losing the ability to assess the metric's appropriateness and distinguish between the metric and the original goal.

Evaluation Practices
Existing evaluation practices can also provide insight into what is deemed important to evaluate, what is overlooked during evaluation, and how quality criteria are being operationalized.
Although routine, the use of automatic metrics may lack justification and validation, as "[e]veryone is in this in-between phase of just using everything they can find" Human evaluations may draw on usage or feedback data in deployed settings, but many practitioners lack access to either deployment set-tings or usage data. 30% of survey participants report conducting evaluations with users in deployed settings, with a sizeable gap among academics (12%) and non-academics (46%) [SQ18]. Interview participants also noted "this gap between academic [evaluations and the] things that matter when you actually deploy something out in the world" [P15]. Since feedback from users of a deployed system might more accurately reflect their in-situ needs than feedback from e.g., crowdworkers, practitioners without access to deployment settings may miss critical evaluation data and have little visibility into possible adverse impacts.
Practitioners manually assess generated texts, but rarely follow a formal method when doing so. For some participants, manual assessment is more of an art than a science, and many do not follow any established methods or a self-devised protocol when doing so: "it just comes down to me reading a lot of samples and then choosing the one which overall seems to be better" [P5]. Indeed, while 77% of survey participants report always or often manually examining generated texts, of those who perform such qualitative examinations, only 38% report always or often applying a formal method (30% academic, 47% nonacademic) [SQ18], and even fewer describe such a method, with a few using a labeling scheme or a sampling protocol for selecting which examples to assess [SQ19]. While common, the practice of (informally and) manually examining generated text to assess quality is rarely detailed in the literature. It can be problematic if results of error analyses (Van Miltenburg et al., 2021) are shown without describing how assessments were made. Practitioners also may not be representative of those who use NLG systems or are reflected in generated text (Hovy and Spruit, 2016). Practitioners often know little about those rep-resented in or contributing to data creation and annotation, or to crowdsourced evaluation. Only 18% of survey participants report always or often having demographic information about the annotators or authors who created the datasets they use [SQ18]. Of those who report using crowdsourcing, only 30% report always or often having demographic information about crowdsourcing judges [SQ18]. Such information can provide crucial insight into the speakers, communities, language varieties, and perspectives represented in and shaping datasets; without it, some datasets-whose text may read to some as if "it's written by [a] small group of people" [P17]-risk being used as representative while excluding wide swaths of people.
Many practitioners re-purpose datasets created for other tasks, with 59% of survey participants reporting always or often doing so [SQ18], while only 48% of survey participants report always or often using datasets created in-house for their project (42% academic, 53% non-academic) [SQ18]. Datasets' suitability should however be carefully interrogated before being re-purposed. If datasets custom-built or validated for other tasks do not properly capture the new task's peculiarities or the desired quality criteria, they can pose threats to the validity of the task and its evaluation.

Rationales for Evaluation Practices
To further probe participants about assumptions, constraints and other considerations that shape their evaluation practices, we also explicitly asked them about what guided their NLG evaluations.
While many practitioners recognize their limitations, automatic metrics are often assumed reliable enough. A participant noted they "wouldn't trust any sort of automatic measure of a text generation system [as they need] more than just a good BLEU or ROUGE score before [they'd] sign off on using a language model" [P11], while others questioned whether automatic metrics "capture anything meaningful" [P13] when assessing latent constructs like creativity. Despite these and other documented shortcomings (Gkatzia and Mahamood, 2015; Novikova et al., 2017; Kuribayashi et al., 2021; Liang and Li, 2021), practitioners do rely broadly on automatic metrics: 50% of survey participants agree or strongly agree that automatic metrics represent reliable ways to assess NLG systems or models [SQ20], while 43% say that metrics developed for one NLG task can be reliably used or adapted to evaluate other NLG tasks (32% academic, 53% non-academic) [SQ22]. One participant remarked that "automatic metric[s are] still more scalable and objective than human evaluation" [SP]. This trust in automatic metrics' reliability across tasks may entrench their use, possibly crowding out other measurement approaches, especially as publication processes may reward metrics that allow for easy comparison to existing models.
When it comes to automatic metrics, many practitioners believe more is better. A participant noted "the current status is you kind of just throw it all at the wall. So if there's another automatic metric, great, let's add that to the suite of metrics we're running" [P13], and another that they "ended up having a lot of metrics because none of them is a superset of any others. They all have issues" [P17]. 54% of survey participants also agree or strongly agree it is a good practice to evaluate with as many automatic metrics as possible [SQ22]. Yet a "kitchen sink" approach to evaluation risks obfuscating quality with quantity, yielding apparent model improvements that may not correlate with actual performance, or may result in expectations to see improvements on certain metrics, even when those may not be necessary or desirable. This approach may exist partly to pay a "tech debt," where metrics are included mainly to compare to legacy baselines assessed with them.
Human evaluations are often assumed to be the gold standard, but the term refers to many things. A participant said they "always trust the human evaluations, that's the gold standard. If the human evaluation is not great, then we would not consider that a good solution" [P9]. This was also echoed by survey participants, with 69% believing automatic metrics should always or often be judged by how they correlate with human evaluations (74% academic, 63% non-academic) [SQ23]. Most participants assume human evaluations can be reliably used for NLG evaluation, with 56% agreeing or strongly agreeing that is the case for crowdsourcing; 71% for manual inspection by practitioners themselves; and 89% for metrics based on usage patterns in deployment settings [SQ20]. Given these variations in practitioners' trust in different types of human evaluations, the blanket use of the term "human evaluation" to refer to them collectively can make it hard to calibrate claims about human evaluation results. . In our survey, 87% of participants believe experts always or often contribute to the evaluation in ways standard crowdsourcing cannot [SQ23]. Experts may however be even less representative of those using a NLG system and may inject their shared biases into evaluations (van der Lee et al., 2019).
Practitioners say NLG evaluations are most constrained by datasets, annotators, lack of metrics, and inability to deploy. When asked to rank resources by how much they constrain their evaluations, 59% of survey participants ranked relevant datasets among their top 3 most limiting resources (48% academic, 70% non-academic), 50% ranked expert annotators, 42% the ability to deploy, 37% crowdsourcing costs (45% academic, 30% non-academic), 36% relevant metrics; with compute and engineering resources being least frequently ranked among top limiting resources.
Existing practices, metrics and data shape the evaluation choices of many practitioners. Participants noted that "academic benchmarks are very important [as] they're [an] easy measure to figure out whether your experiments are working" [P15] or that "we used all the basic standard automatic [metrics in] that suit" [P2]. When asked to rank considerations that guide their evaluations, 55% of survey participants ranked among their top 3 existing standards and practices (67% academic, 43% non-academic), 36% ranked dataset and metrics availability (45% academic, 26% non-academic), and 29% maintaining performance in deployed settings (12% academic, 46% non-academic) [SQ26]. This emphasis on existing practices, metrics and data may be due to challenges to developing them, expectations of audiences like a research community or a team, or to make it easy to compare with prior work. This reliance also makes their proper development and validation even more critical.
Poor performance is most often attributed to poor data quality, with some practitioners also believing "one [way] of driving quality upwards is by feeding in more data to these models" [P12]. Oth-ers echoed this belief, explaining how their "model tends to talk about things similar to its training data; whatever you feed into the model, that's what the model [will] output" [P7] as "language models are kind of bottom up, you have a pile of historical data and you learn the patterns of language used in that data" [P11]. In our survey, 62% of practitioners agree or strongly agree that output quality is primarily impacted by the quality of the training data (55% academic, 70% non-academic) [SQ22]. Practitioners may thus rely mostly on expanding or curating datasets to improve performance, overlooking other approaches (e.g., optimization methods, UX design) or offloading issues onto dataset builders (Sambasivan et al., 2021).
Practitioners believe datasets should be formally evaluated, but that rarely happens. 71% of survey participants say there is always or often a need to evaluate the quality of training datasets (77% academic, 63% non-academic), and 87% say so for evaluation datasets (97% academic, 77% nonacademic) [SQ23]. An interview participant, however, pointed to the "concern of who is in charge of validating all these datasets. I'm sure some slightly harmful things can fall through the cracks, that no one takes a look at just due to the sheer size of these datasets" [P3]. Datasets that are not carefully evaluated may not be well-matched to what they aim to measure (Blodgett et al., 2021) or give rise to harms (e.g., Prabhu and Birhane, 2021).

Ethical Considerations and Impacts
Finally, we solicited practitioners' reflections on ethical issues or adverse impacts stemming from their projects, and if they try to measure or mitigate such issues. We use the term fairness and inclusion (F&I) to refer to a broad range of adverse impacts, like allocational and representational harms ( Practitioners hold a range of beliefs about what properties the language in their datasets or generated by NLG systems should embody. 52% of survey participants agree or strongly agree that generated text should adhere to standard grammar rules of the language it is generated in [SQ33], and 57% say neutrality is a property NLG systems should embody [SQ37]. Many also want their system or model to capture either a "neutral voice" (30%), "no voice" (8%), or not to capture any particular writing or speaking voice (33% overall; 45% academic, 20% non-academic) [SQ35]. By contrast, only 18% believed the datasets they use do not capture any voice, and 18% say they capture a neutral voice [SQ34], while 21% also believe their systems or models do not capture any voice, and 16% say they capture a neutral voice (10% academic, 23% non-academic) [SQ36]. Notably, while 13% of non-academic participants think their system captures a specific, non-neutral voice, none of the academic participants do; this is also reflected in the interviews, where multiple industry participants aim to capture a specific type of voice like matching a "company voice" [P6]. Such beliefs about language are consistent with pervasive language ideologies-"the cultural system of ideas about social and linguistic relationships" (Irvine, 1989)-including that some kinds of language (and their speakers) are inherently correct, neutral, or appropriate for public use (Rosa and Burdick, 2017), and thus that generated language can be neutral or voiceless. Since assumptions about language are rarely named or interrogated, it remains unclear how they shape practitioners' evaluations. But they can give rise to harm; for example, a belief that race-related language is inherently inappropriate if generated (Schlesinger et al., 2018) can erase language by or topics important to minoritized users.
The topics and linguistic style of generated text are rarely systematically measured. When asked if they measured the linguistic style of generated text, a participant, under the assumption that "these models [just] follow the training data," noted they "don't necessarily analyze the generated output" [P9]. Only 25% of survey participants report systematically measuring what topics their system tends to generate, while 18% report systematically measuring the linguistic style of generated text (10% academic, 27% non-academic) [SQ31] (notably, though, 43% report informally assessing style or tone, e.g., via eyeballing). Coupled with assumptions about neutral or voiceless language, this suggests that the over-or under-representation of topics, perspectives, or language varieties may not be generally conceptualized as F&I issues and measured. NLG systems may, however, unintentionally prioritize "voices" over-represented in datasets, especially when assumed neutral and not examined.
To mitigate F&I issues, practitioners seek control over what is generated, as many do not want "an uncontrolled model to spew out random information" Blocklists are often the go-to mitigation technique for F&I issues. A majority of interview participants report using blocklists as an F&I mitigation approach-including to control system outputs-and 21% of survey participants report always or often using blocklists (7% academic, 37% non-academic) [SQ30], with notably more use in non-academic settings. However, as one participant noted, "we have blocklists and phrasal blocklists [to help] prevent some of the major categories of failure, and they're understandable and explainable and adjustable, which is the upside. But is the coverage good? I don't think it is" [P12].

Conclusion
We identify practices and assumptions that shape the evaluations of NLG systems, which we argue deserve further attention due to their potential adverse impacts, particularly when those practices and assumptions go partly stated, or are not stated at all. Our findings suggest a number of approaches and future directions, including some raised by survey participants when asked how to improve NLG evaluation [SQ38].
First, the range of implicit assumptions and choices we uncover indicates that future evaluations might be strengthened by making these assumptions and choices explicit, including how quality criteria are conceptualized, how manual analysis is conducted and shapes decisions, and whether some evaluation approaches gate others (as automatic evaluations often gate human ones).
Participants also suggested several opportunities for evaluation; one respondent raised the possibility of lab studies to explore people's perceptions and uncover failure modes without fully deploying a system, while another emphasized the need to study systems in interaction given the importance of social and task context. Industry participants described a range of practices, including user engagement metrics, red teaming, and blocklists, which may be less well-studied in academia.
The challenges surrounding the use of automatic metrics-particularly the pressure towards a "kitchen sink" approach despite widespread recognition of many existing metrics' limitations-may not have an easy remedy. A "kitchen sink" approach should be thoroughly justified, and not used only because practitioners lack clarity about their goals and quality criteria. Beyond clarity about goals and quality criteria, a better understanding of evaluation datasets and metrics-whose perspectives and language are contained in datasets, what datasets and metrics can and cannot capture, and how metrics are connected to user or downstream outcomes-may also clarify their appropriateness for different settings, and disincentivize simply evaluating with as many of them as possible.
Furthermore, human evaluations, particularly those grounded in specific deployment settings or involving extensive participation, tend to be costly. Not only is such work expensive, but it may also involve qualitative or participatory approaches less familiar to NLP practitioners (than using automatic metrics), or may slow down product releases or publications. Industry teams often constrained by tight deadlines may be especially unlikely to adopt qualitative or participatory approaches in the absence of readily available, well-validated methods and guidance for their use. While such approaches will never be as inexpensive as many automatic metrics, user studies or design workshops with users to help develop and solidify methods and guidance might lower barriers to their broader adoption. Finally, shifting publication incentives towards work grounded in deployment scenarios, as well as work engaging with methods from other disciplines (e.g., HCI or sociolinguistics) may encourage the investment of time and resources needed to carry out more thorough evaluations.
Turning to ethical considerations, our participants often raised the need for investment in the development of resources for conceptualizing and measuring F&I issues. These include better frameworks of harms (to help anticipate issues beyond toxicity), appropriate datasets, additional frameworks to guide (and scale) qualitative evaluations, task-specific metrics to combat automatic metric reuse without proper validation, and approaches for measuring latent qualities such as voice, style, and topic (which remain longstanding open questions). Other opportunities include broader disciplinary shifts. Several participants desired clearer community standards addressing not only best practices in modeling, but also meta-level questions-e.g., who makes decisions on what ethical considerations to prioritize (for evaluation and beyond), or how to develop repeatable mechanisms for surfacing, reporting, and addressing failures as they arise.
Our interview and survey studies-which are intended to uncover community assumptions, constraints, and practices-necessarily surface those of some but not all practitioners. Although we aimed to recruit as widely as possible, our participants were recruited via snowball sampling, seeded with researchers and industry groups that were wellknown to us, as well as on social media (and therefore biased by our followers). Additionally, our interviews and surveys were only in English. Those who participated largely focus on research, work on English language NLG systems, and live and work in the Global North. While this may reflect the NLG research community and who has been able to participate in it, our findings may not reflect the assumptions or practices of practitioners working in other settings.
The results of our studies are also rather descriptive, and (though we think they are comprehensive) they cannot provide an exhaustive set of assumptions, practices, or F&I concerns, and should be interpreted accordingly. While our goal is to encourage further reflections about practices, the factors that shape them, and possible implications, we may risk instead inadvertently discouraging certain evaluation practices. • Yes, no, maybe [SQ17] Is your goal or one of your future goals to personalize the voice of your system for users? By voice we mean that it may capture a particular style of speaking or writing, or a particular persona.

References
• Yes, no, maybe

B.4 Evaluation practices: what do you do during evaluation?
[SQ18] For the following statements, please answer how often you or others on your team do the following. Never, Rarely, Sometimes, Often, Always, (N/A) • Use automatic evaluation metrics • Use automated metrics originally developed for other NLG tasks (off the shelf).
• Use online crowdsourcing for model/system evaluation (e.g., Mechanical Turk) • If used, online crowdsourcing evaluation is contingent on your system or model achieving a certain performance level on some automatic metrics. • During development, you or others on your team manually examine or evaluate the outputs of your NLG system or model. • Follow a formal method or procedure when you or your team members are manually inspecting/evaluating the system/model output (beyond just "eyeballing"; e.g., applying a coding scheme) • Evaluate with users in actual deployed settings (users who use the system for their own reasons, self-incentivized, and are not paid as test subjects/crowd workers) • Seek out judges or annotators with certain specific expertise to assist with your system or model evaluation (e.g., linguistics) • Use datasets that were made in-house specifically for your project • Re-use datasets (as-is or modified) that were created for other tasks or work (typically by others outside your team) • Have demographic information about the judges, annotators, or writers who created the datasets or reference solutions you use • Have demographic information about online crowdsourcing judges or annotators who evaluated the outputs of your system/task [SQ19] If you follow a formal method or procedure for your manual evaluation or inspection of the system/model output, could you please describe your method? R: Write-in B.5 Evaluation rationales: why do you follow certain evaluation practices?
[SQ20] For the system or task, you described earlier, do you agree or disagree that the following evaluation methods are reliable ways to assess performance? Strongly agree, agree, neither agree nor disagree, disagree, strongly disagree • Using standard automatic metrics (e.g., BLEU, ROUGE, METEOR, NIST, perplexity) • Researchers and practitioners themselves manually examining the system/ model output • Using online crowdsourced evaluations (e.g., Mechanical Turk) • Metrics based on how the system or model is being used in actual deployed settings (by users who use the system for their own reasons, self-incentivized, and are not are paid as test subjects/crowd workers) [SQ21] For the system or task, you described earlier, other than those listed above, are there other evaluation methods you believe are (or would be) reliable ways to assess the performance of your system or on your task? R: Write-in [SQ22] In general, please answer if you agree or disagree with the following statement about evaluating NLG systems or tasks: Strongly agree, agree, neither agree or disagree, disagree, strongly disagree • Metrics developed for one NLG task or system can be reliably used or adapted to evaluate other NLG tasks or systems. • It is a good practice to evaluate NLG systems or models on as many automatic metrics as possible.
• The quality of NLG systems or models' output is primarily impacted by the quality of the training data. • Online crowdsourced evaluations are a good proxy for evaluations with users.

B.9 Final thoughts
[SQ38] How do you think NLG evaluation could be improved? R: Write-in [SQ39] How would you characterize your experience with practices and metrics aimed at ethical AI? This does not have to be related to your NLG work.
• No, I have no experience, I have a bit of experience, I have worked on ethical AI related issues, Ethical AI is the focus of my work [SQ40] If you would like enter the raffle drawing for one of the ten $50 amazon gift cards, for anonymity purposes, after submitting this form you will be provided with a link to another form to fill in your email address and enter the raffle. For this, please also write down a key phrase here, which you will also be asked to re-enter on the raffle form. We will only be use this key phrase to validate that the raffle participants have completed the survey. Please don't use a key phrase that is associated any accounts. R: Write-in Non-Academia Academia Other (e.g. Safety) Evaluation for ensuring the on-going performance of a deployed system.
Evaluation for reporting to other teams or third-party platforms.
Evaluation for reporting to your team or manager.
Evaluation for reporting to a research community.  Figure 4: Question 16: Is your work (or will your work be) part of an NLG system that is deployed to users in real settings (users who use the system for their own reasons, self-incentivized, and are not paid as test subjects/crowd workers)?  Figure 7: Question 20: For the system or task, you described earlier, do you agree or disagree that the following evaluation methods are reliable ways to assess performance?
The quality of text output is primarily impacted by the quality of the training data.
It is a good practice to evaluate on as many automatic metrics as possible.  Figure 12: Question 30: Do you intentionally try to block certain type of content from your generation? (e.g., by using blocklists or classifiers, cleaning the training data, etc.) Other fairness and inclusion issues aside from toxic, hateful, and offensive language Toxic, hateful, or offensive language produced by your system or model The linguistic style or tone of the content generated by your system or model We aim for it to capture a non-neutral, specific voice (a particular persona). We aim for it to capture an assemblage of voices. We aim for it to capture a neutral voice.
We aim for it to capture no voice. We do not aim for it to capture any voice.