Modeling Disclosive Transparency in NLP Application Descriptions

Broader disclosive transparency—truth and clarity in communication regarding the function of AI systems—is widely considered desirable. Unfortunately, it is a nebulous concept, difficult to both define and quantify. This is problematic, as previous work has demonstrated possible trade-offs and negative consequences to disclosive transparency, such as a confusion effect, where “too much information” clouds a reader’s understanding of what a system description means. Disclosive transparency’s subjective nature has rendered deep study into these problems and their remedies difficult. To improve this state of affairs, We introduce neural language model-based probabilistic metrics to directly model disclosive transparency, and demonstrate that they correlate with user and expert opinions of system transparency, making them a valid objective proxy. Finally, we demonstrate the use of these metrics in a pilot study quantifying the relationships between transparency, confusion, and user perceptions in a corpus of real NLP system descriptions.


Introduction
Among the draft Ethics Guidelines for Trustworthy AI released by the European Union in 2019 were calls for greater transparency around deployed systems using artificial intelligence, advising that an "AI system's capabilities and limitations should be communicated to practitioners or end-users in a manner appropriate to the use case at hand." (High-Level Expert Group on AI, 2019) This is a highprofile example of the vagueness endemic to guidance on ethical disclosure and AI-what constitutes an appropriate manner of communication?
There is a growing awareness of the importance of communicating AI system function and performance clearly and understandably. This is both a matter of public interest, for mitigating harms caused by, for example, racial biases in the performance of human classifiers (Raji and Buolamwini, 2019), but also in the interest of system providers, as users who feel like they do not understand how machine learning models work and what kind of information they rely on tend to be more resistant to using them (Poursabzi-Sangdeh et al., 2018).
However, while fairness (Dwork et al., 2018;Harrison et al., 2020) and privacy as general principles (Ji et al., 2014;Papernot et al., 2016) have been well-studied in the context of AI, transparency does not receive quite the same amount of attention. A core problem with discussing transparency is that the term is overloaded (Lipton and Steinhardt, 2019), with many different definitions in the literature (Felzmann et al., 2019). It has a broad meaning in the public consciousness, from which some studies adopt a "you'll know it when you see it" definition (Doshi-Velez and Kim, 2017). As opposed to the notions of transparency as explainability (Veale et al., 2018) and as invisibility (Hamilton et al., 2014), transparency as disclosure (Suzor et al., 2019), the extent to which the producers of an AI system or service provide detailed messaging about a system to stakeholders such as buyers, users, or the general public. This third sense, disclosive transparency, is particularly subjective. Pieters (2011) identifies a confusion dynamic where providing "too much information" about a decision system worsens user understanding and trust. This confusion problem has serious ramifications as AI systems grow ever more connected in our lives-if disclosive transparency decreases user trust, stakeholders are further incentivized to not disclose. Unfortunately, studies on this relationship are small-scale and subjective. This is why measurable, repeatable, and objective measures of disclosive transparency are sorely needed. In this work, we decompose disclosive transparency into two components: replicability and style-appropriateness, develop three objective neural language model measures of them, and apply them in a pilot study of real systems.

arXiv:2101.00433v3 [cs.CL] 27 Aug 2021
Our goal is to develop an objective measure of disclosive transparency, that while being repeatable and explainable, is also intuitive, and wellcorrelated with subjective opinions. However, at face value there is not an obvious way to assign a numerical value to the "level of transparency" in a description. To achieve this, we decompose disclosive transparency into two components: the replicability, or degree to which the requisite content to reproduce the system being described is present in a description, and the style-appropriateness, or the degree to which it is written in a manner understandable to a member of the general public. However, these are still very subjective.
The key insight that underpins the possibility of developing objective measures of replicability and style-appropriateness is this: disclosure is a communication task. In describing a system, an explainer, typically an authority who is providing an AI service to the public, attempts to encode information about the design and function of their system in a summary. We find the communicationbased definition of meaning provided in Bender and Koller (2020) useful for motivation. In their framework, the meaning, m of an utterance is a pair (e,i) of the surface form e (such as text or speech audio) and communicative intent, i which is external to language. In the case of disclosive transparency, this i is the particulars of the system being described. Any assessment of the disclosive transparency of a description e is fundamentally an assessment of its underlying i-the degree to which i contains the information necessary to reconstruct the system being described.
One other key element underpins our objective transparency metrics: some short descriptions come paired with longer ones. Anyone who has clicked "I Agree" without reading a terms of service, but has taken the time to look over a short, tothe-point statement on how data is used, knows that the lengthy legalese-laden descriptions that truly define systems we interact with can be sufficiently summarized succinctly. However, the succinct description exists to carry the detailed meaning of the actual contract a user is agreeing to. In other words, for the short description e ′ of an agreement, there exists a 'source' document e that is much more detailed, which share a common i. An analogue in system descriptions is what enables our metrics.

Style-appropriateness Replicability
BERTScore affinity rating Figure 1: A diagram of our proposed method for extracting the style-appropriateness and replicability objective features to characterize disclosive transparency.

Objective Transparency Metrics
We now move to developing usable metrics that quantify both the replicability and styleappropriateness of real system descriptions. A good source of real-world system descriptions are the system demonstration tracks at academic AI conferences. Ideally, the purpose of a system demonstration paper at an academic AI venue is to produce a detailed enough description of a system that a colleague could (given sufficient resources and data) replicate it. For each of these lengthy documents, e i , there exists an abstract, e ′ i which briefly describes the system. Although these abstracts are not intended for consumption by the general public, they do succinctly describe implemented systems, some of which are intended to (at least hypothetically) be used by the public. Combined with the fact that these paper/abstract pairs are freely available online, these are an appealing subject of study for transparency modeling. Details on the pilot study we complete using demonstration track NLP papers is provided in section 4. Figure 1 depicts, at a high level, how we assess our objective disclosive transparency metrics, the style-appropriateness 'clarity' metric C(e), and the replicability metrics of 'sentence affinity' R A (e ′ ,e) and simulated 'information recovery ratio' R R (e ′ ,e). We assess these features using language model scores derived from GPT-2 (Radford et al., 2019) and BERT (Devlin et al., 2018).
For fine-tuning data on academic language in AI and CS, we produce a training dataset by crawling a random sample of 100k L A T E X files from the arXiv preprint repository with topic label cs.* from 2007-2020. Additionally, we further collect all 30k crawlable L A T E X files from the cs.CL label from 2017-2020, corresponding to recent research in natural language processing and computational linguistics. From these raw .tex files we produce task-specific plaintext training corpora as described in Appendix C. We then fine-tune the GPT-2 regular pretrained model from Huggingface (Wolf et al., 2020) on the corpora as explained below.

Modeling Replicability
When discussing system descriptions, replicability typically refers to the ability of a third party to, from the description, implement a functionally analogous suitable replacement system to the original. As a tool that automates the replication procedure is impractical, we propose using the process of recovering the full text of the system description from the abstract as a proxy for replicability.
As we established previously, a reasonable assumption for the communicative goal i of an author of an academic paper e describing a system is to transmit the information necessary to reconstruct it. Obviously, it is not generally possible to perfectly recover i from e ′ or academic papers beyond the abstract would be generally superfluous.
The problem of generalized inversion of communicative intent i from form e is yet unsolved (Bender and Koller, 2020;Yogatama et al., 2019). Thus, it is reasonable to treat recovery of e from a short summary e ′ as a proxy for recovering i, reconstructing the system. In other words, the replicability of an abstract e ′ can be modeled as the amount of information contained in the full article text e than can be recovered from e ′ . We propose two metrics to simulate this process using the document e and abstract e ′ ; trigram information recovery and sentence affinity.

Trigram recovery
We directly attempt the recovery of e from e ′ using a generative language model (in this case, GPT-2 (Radford et al., 2019)) fine-tuned on a full-text recovery task, where for each abstract the replicability score is the rate of trigram information in e recovered by the model-generated text.
The model generates predicted sentences e g it expects from the full paper conditioned on an abstract. Given promising results on quantifying semantic complexity using simple measures such as n-gram entropy (McKenna et al., 2020), we use a trigram self-information content metric to measure the amount of information content in e ′ that is recovered in e g .
Using the training dataset global trigram distribution, we take the ratio of the trigram selfinformation of all trigrams present in e g that were recovered from e against the total trigram selfinformation of e. This gives us a ratio of recovered form information R as defined in Equation 1: (1)

Sentence affinity
Using the aforementioned trigram recovery metric to model replicability has risks. Chiefly, they include that the distribution of source papers can be too specific, and not contain requisite information, and that the generated output would be too noisy to really meaningfully simulate the process of inverting system from abstract. Thus, we present an alternate method here. Zhang et al. (2019) presented a method for evaluating the similarity of sentence pairs called BERTscore, which is computed by averaging a greedy match of the cosine similarities of the BERT (Devlin et al., 2018) token embeddings b(t) for each token in a pair of sentences: We propose extending this to match a sentence over a set of sentences in a document to produce a document-level affinity score, R A (e 1 ,e 2 ) = 1 e 1 s 1 ∈e 1 max This metric has advantages over trigram recovery in that it doesn't rely on training a generative language model, and is more readily interpretable, as Figure 2 demonstrates.

Modeling Style-appropriateness
Orthogonal to the question of whether an abstract contains the necessary information to perform a replication is the question of who would be capable of performing the replication. This question is important for assessing appropriateness of a description to layperson audiences. We simulate this with a pair of language models tuned to two styles of writing-one general interest, one scientific-to make a perceptual model of "academic style," as a Abstract: 1: We present Tabouid, a word-guessing game automatically generated from Wikipedia.
2: Tabouid contains 10,000 (virtual) cards in English, and as many in French, covering not only words and linguistic expressions but also a variety of topics including artists, historical events or scientific concepts.
3: Each card corresponds to a Wikipedia article, and conversely and conversely, any article … BERTScore Best candidate abstract sentence Article Sentence 17: This automated process means that Tabouid can benefit from a wealth of 10,000 cards in English, and as many in French, covering not only words and linguistic expressions but also a variety of topics including artists, historical events or scientific concepts. Figure 2: A demonstration of how our sentence affinity curve simulates the replicability of the abstract based on the degree to which it "covers" the content of the full article. For each line in the source document, the maximum pairwise BERTscore between it and the sentences in the short system description is measured. The affinity-replicability score of the article is the paper length-normalized area under this curve. layperson would probably require many textbook lookups to understand an academic paper or detailed terms of service than they would reading a Wikipedia article. This approach is analogous to a popular the use of perceptual likelihood ratios in speech processing, which have been shown to correlate well with subjective perceptual opinions (Saxon et al., 2020).
For each passage we extract a likelihood ratio between two language models, one a GPT-2 model fine-tuned on the 100k arXiv file corpus, the other the vanilla GPT-2 model that is pretrained on a large, diverse corpus of online English text. Given the system demo abstract e, we compute the styleappropriateness C as a log-likelihood ratio that it belongs to this academia-specific distribution A or a general public distribution V as follows:

Pilot Studies
To analyze how well our objective measures track with subjective notions of transparency, and to demonstrate how they might be used to investigate how transparency affects prospective user opinions in the real world, we perform two pilot studies utilizing our metrics and a corpus of real-world AI system descriptions.

*ACL System Demo Corpus
We extract system demonstration abstracts from EMNLP 2017-2020, ACL 2018-2020, and NAACL 2018 and 2019, retrieving a corpus of 268 abstracts describing a variety of demonstrations, including systems intended for use by the general public (e.g., translation systems, newsreaders) as well as demonstrations that are of interest more narrowly to the NLP community, software developers, or academics at large (e.g., toolkits, packages, or benchmarks). As we are interested in system descriptions for non-experts, we restrict our analysis to abstracts which describe systems intended for use by laypeople. This set contains 55 abstracts, describing diverse systems from automated language learning games, to news aggregators, to specialized search engines for medical topics.
We first collect expert opinions on how the abstracts conform to the aforementioned dimensions of transparency to analyze the quality of our automated metrics. Then, we collect salient layperson opinions of trust, understanding, and fairness to demonstrate a pilot study of how transparency drives user attitudes.

Connecting Objective with Subjective
Our first pilot study seeks to determine the extent to which our objective measures faithfully model subjective notions of transparency. To do this we must first pose a set of precise questions for subjectively assessing disclosive transparency. We identify three largely disjoint dimensions of disclosive transparency that each follow directly from a key question an implementer might ask when initially trying to understand a description: Task Transp.: What task does this system solve? Function Transp.: What components does this system contain, and how do they work? Data Transp.: What are the inputs and outputs of this system? What kind of data collection and storage is required to train and operate it?  Table 1: Inter-rater reliability assessed using variance, average pairwise Pearson's correlation coefficient (PCC) and p-value for subjective transparency scores.
Each of these questions concerns some kind of information contained in i. They can be posed as transparency questions by appending "to what extent does the description explain..." to the start.
We consider task transparency interesting because members of the general public are often learning of new system application areas in which emerging technologies such as NLP can be applied, and without communicating the parameters of the task a system is solving clearly, users cannot understand it. Function transparency is perhaps the most natural dimension, as discussing the "how" of a system is fundamental to explaining how it works. Finally, we consider data transparency because many public discussions of AI ethics center on the use and misuse of data, and it is a central focus of regulatory discussions of AI.

Collecting the Subjective Ratings
Four NLP Ph.D. students provided five-point Likert opinion scores of the task, function, and data transparency levels for each abstract in the *ACL Corpus. To ensure consistency across the raters we analyzed average abstract-wise variance, as well as the average pairwise inter-rater Pearson correlation coefficient (PCC) and p-value for each of the three transparency categories. Table 1 demonstrates the high inter-rater reliability of the scores, with each average variance < 0.55 on the 5-point scale.
For each abstract in the dataset we assign a subjective task, function, and data transparency score equal to the mean of the four opinion ratings, constituting the subjective transparency scores.

Pilot Study of User Attitudes
In this section we briefly outline how we assess user opinions regarding the system descriptions.
Starting from the three aforementioned dimensions of transparency-task, function, and datawe characterize three sets of user concerns orthogonal to these dimensions. These sets are "understanding," "fairness," and "trust." Together these form "user response variables" such as task understanding, function fairness, or data trust, which can be posed as questions about what the user thinks, believes, or understands.
To measure how user attitudes and confusion correlate with description transparency, we complete a user survey study using Amazon Mechanical Turk (AMT). Each abstract was shown to 10 crowd workers selected from a set of majority English-speaking locales (US, CA, AU, NZ, IE, UK), who were instructed to read the abstract and answer two sets of multiple-choice questions. The first set, opinion prompts, consists of five-point Likert scale subjective attitude questions; the second set, of retention questions, reveals how well the users can recall phrases from the abstract they just read. See Appendix B for details on opinion prompts, retention questions, and survey methodology.

Results
To study the correlations both between pairs of subjective variables and between our objective metrics and the subjective responses, we use Pearson's correlation coefficient (PCC). We use PCC rather than rank-correlation coefficients such as Spearman's because we are interested in evaluating whether our metrics can implicitly capture relative deltas opinions in order to be a useful proxy for the opinions themselves in future work.
In subsection 5.1 we study how the objective transparency metrics connect to expert opinion ratings along the three dimensions of transparency identified in subsection 4.2. These are the results from the first pilot study, testing if our objective metrics capture the subjective notions of disclosive transparency as understood by experts.
In subsection 5.2 we present the results from the second pilot study, and analyze relationships that appear between our objective transparency metrics, the subjective transparency ratings, and user opinion responses. In particular, we seek to determine if our objective measures can be used as suitable proxy for the subjective expert opinions in modeling user responses to transparency.
In this section we have scatter plots containing several overlapping (x,y) coordinates. We opt to represent these points where multiple identical samples are present by varying both marker color and size to convey the relative quantities of repeats. Table 2 shows the Pearson's correlations and pvalues for the objective style-appropriateness and   replicability transparency metrics and the subjective expert transparency ratings, and between the objective metrics themselves. The style-appropriateness (S-A) metric clearly captures information about task transparency, as it exhibits a clear positive statistically-significant correlation. However, the S-A metric does not exhibit significant trends with data or function transparency opinions.

Objective vs Subjective Transparency
On the other hand, the replicability metric does exhibit statistically significant relationships with clear positive correlations to both function and data transparency, for trigram recovery, and with Function transparency alone for sentence affinity rate; neither exhibit statistically significant relationships with task transparency. Finally, the replicability and S-A metrics do not exhibit a significant correlation with each other.
Taken together, these results suggest that the replicability and S-A metrics capture subjective notions of transparency. Furthermore, they capture complementary elements of transparency, exhibiting no statistically significant correlation to each other, while significantly explaining variance Affinity Score: High Best Match Sentence: "… we apply a pairwise-based [ML] tool, Support Vector Machine for Ranking (SVMrank) to estimate the complexity of the example sentences using Japanese-Chinese homographs as an important feature" Affinity Score: Best Match Sentence: "We hypothesize that the longer-term impact of the app will be to help users become better, more confident readers with an increased stamina for extended reading." in different dimensions of subjective transparency. Figure 3 shows how our replicability trigram information recovery score is positively correlated with both expert function and data transparency ratings, however, this relationship is less strong with task transparency. Meanwhile, the S-A loglikelihood ratio score captures user opinions of both function and task understanding, but is not predictive of retention score ( Figure 5).
While the relationship between replicability and function transparency was expected, the connection between S-A and task transparency was surprising at first. We think this is driven by a tendency for more detailed, transparent descriptions of a given task to more heavily utilize common language and analogies-after all, tasks such as translation, newsreading, and language learning all exist in the real world as topics of everyday discussion. Figure 4 shows the sentence affinity rate curves for two abstracts in the *ACL Demo Corpus. This demonstrates the interpretability of the affinity rate replicability score-the top curve has several peaks of particularly high BERTscore, which match to  specific, technical sentences in the abstract providing concrete detail on how the system works.
Meanwhile, the bottom curve has few high-scoring peaks, the highest of which match to one of the vague sentences in the source abstract. In other words, descriptions with more specific technical detail in the abstract score higher on our objective transparency measures, as desired.

Transparency and Response Variables
To relate the abstract-level objective and subjective transparency ratings to the response variables, we assess pairwise PCC between pairs of (transparency label, response variable) across the 55 abstracts. Table 3 contains a sample of pairwise correlation analyses of expert subjective transparency opinions and the user response variables.
Of the statistically significant correlations between subjective scores we observe, the positive relationship between expert-rated average task transparency and average user self-reported task understanding is the strongest. Similarly, there is a strong, statistically significant positive correlation between task transparency and self-reported function understanding. However, we were surprised to find that our expert-rated average function transparency does not exhibit a statistically significant correlation with either of the aforementioned understanding variables. Figure 6 depicts the positive correlations of these two response variables with task transparency.
In other words, users believe they understand not only the task, but the function of the system being described better, the more well-motivated and transparent the discussion of the task is. However, transparency in describing the function of the system will not lead to users to think they understand either aspect better. This result was surprising.
As function transparency captures "how the system works," while task transparency is concerned with "what the system does," it is probably the case that understanding the task is a necessary prereq-  uisite for understanding the system function. This could explain the connection between task transparency and user function understanding. However, this connection is illusory: for example, a detailed description of the problem of translating French to Arabic contains no information about attention mechanisms.

Related Work
Most previous work on explainability and intelligibility in transparency focuses on "explanations that are contrastive and sequential rather than purely subjective" (Miller, 2019). Furthermore, prior studies regarding transparency have tended to focus on transparency in post-facto user judgments of automated systems following their use. For example, Kizilcec (2016) and Wang et al. (2020) both studied how system users viewed the fairness of a system based on levels of transparency and the final decision the system made for them. In both studies, the authors found that transparency affects perceived fairness, but ultimately the strongest predictor of perceived fairness was whether the final system outcome was favorable to the user or not. Notions of transparency intersect broadly with work involving fairness, accountability, and ethics of machine learning. Many solutions intended to remedy ethical concerns of machine learning include a disclosive component. Lim et al. (2009) studied how novice end-users can understand the function in context of intelligible systems. Similarly to us, they derive explanations from fundamental questions of "what," "why," and "how" that resolve the gulf of evaluation and gulf of execution in the user, as put forward by (Norman, 1988).
While we are primarily concerned with disclosure to laypeople, disclosure between practitioners has been a key thrust of work in recent years. Disclosing potential deficiencies, biases, failure points, and intended use cases for datasets (Holland et al. and full systems (Arnold et al., 2019) is crucial in ensuring the ethical construction and deployment of systems that utilize machine learning while minimizing the risk of potential harms. All of these proposals focus heavily on outlining best practices to ensure that various sources of bias are accounted for and made available to system builders and decision makers before harms are perpetuated.
As briefly outlined in section 1, most work on transparency focuses on providing intelligible reasoning for decisions (Vaughan and Wallach, 2020). Lipton (2018) introduced simulatability, or the capacity for a user to comprehend all the constituent operations of a model by manually performing them, as fundamental to transparency. This relies on a notion of decomposability (Lou et al., 2012). Intelligibility is both an end in itself and a useful tool for ensuring buy-in from stakeholders such as users (Muir, 1994) and management (Veale et al., 2018). Knowles (2017) deal with designing systems that use intelligibility to convey evidence of trustworthiness under uncertain scenario, and frame intelligibility as the degree to which the system's underlying logic can be conveyed to users.

Conclusions
Our replication-based framework allows us to tackle the problem of characterizing disclosive transparency in NLP system descriptions. We developed style-appropriateness and replicability metrics using neural language models and demonstrated that they capture multiple subjective dimensions of transparency. With these metrics we performed a pilot study characterizing layperson responses to varying levels of transparency in system descriptions, demonstrating the value of the conceptual framework and the concrete tools proposed.
There are three natural directions for future work to build on these results. First, the automated transparency metrics can be used to scale up future opinion studies, enabling the study of user opinions across larger corpora of system description text, free from the need for expert annotations. Second, more sophisticated conditional abstract-to-full text inversion models can be directly swapped in with the replicability metric to produce better automated transparency scores. Finally, the metrics can be used to guide other NLG processes, such as to enable transparency-constrained abstractive summarization or style transfer. Table 4 demonstrates that increased replicability scores are associated with decreasing opinions of the ethics of the underlying task (increasing disagreement with the statement "I believe this system is made to solve an ethical task") but increasing opinions of function fairness (increasing disagreement with "I am concerned about the fairness of this system"). This was a surprising result that we are interested in attempting to replicate on larger corpora. With this relatively small sample size, it is hard to interpret why these conflicting trends in ethical opinions might arise.
This negative relationship between replicability and task ethics could be an artifact of the data used to train the abstract-to-full text inversion model underpinning the replicability metric. It might be the case that there is higher public awareness and controversy toward ML tasks that are more wellrepresented in the arXiv inversion dataset. However, this result might also represent a genuine ethical quandary for further research in transparency.
Much of Section 1 focused on making the case for why measuring transparency is important. However, if it really is the case that increased measurable disclosive transparency drives decreasing trust in system ethics, it's conceivable that system providers or governments could become less inclined to disclose. We would consider this a negative outcome. However our results so far do not overwhelmingly suggest this relationship exists.
Another potential limitation of this study is that layperson end-users are not the target audience of these NLP conference abstracts. The decision to use academic writings as stimuli was driven primarily by the availability of (short description, lengthy treatment) pairs consisting of the abstracts and their corresponding full texts. However, we believe the short description/long document pairing is applicable to a number of fields, including terms of service as discussed in Section 2, and to other types of systems described in non-academic, but still technical text, such as patent disclosures and instruction manuals.
Finally, as attitudes toward privacy, fairness, and ethics can be culturally variable, we note that our survey results were sourced from crowd workers in majority-English speaking countries and may not be broadly representative.

A The Replication Room
Many insights that drove the aforementioned metrics sprang from a thought experiment inspired by Searle (1980) called a "replication room." Imagine a room in which a person sits with access to a hypothetically comprehensive set of domain-specific and general resources on machine learning (ML), natural language processing (NLP), software engineering, programming languages, which collectively enable them to look up any applicable term of art or find instructions on implementing any standard component or system. In this replication room, this implementer receives system descriptions e from which they are tasked with producing a functionally analogous reproduction of the system i being described.
Under this scenario, clearly no system description can be less transparent than a blank piece or paper or completely unrelated text. In any of these edge scenarios, the description provides no clues about the system the implementer is intended to reproduce, and their chance of success is nothing but the prior over all possible implementable systems. Thus, in a channel communication sense no information is conveyed.
On the other hand, a system description from which every detail is accounted for would represent a maximum transparency scenario. A complete printout of all elements of the source code and all setup/training steps, or a detailed tutorial, or even the full text of an academic system description paper could be sufficient for a successful reproduction. In these cases, no further information content would increase the level of transparency, as all information needed is included.
We dub the degree to which a system description contains the necessary information to perform a replication its replicability. Replicability as a notion of transparency is incomplete, however, as it makes no assumptions about the degree to which the supporting materials might be used, or the kind of person who is performing the replication.
Another component is the style-appropriateness of the description. In short, the style of language in the description, the overall quality of writing and ease of reading, the level of reliance on unexplained technical jargon, and various other factors all impact the degree to which a layperson would require use of the assistive materials, or the level of expertise that would be required for a replication to be successfully carried out without using the assistive materials. More precisely, the styleappropriateness of the description is tied to the degree to which domain knowledge about the AI problem area is necessary to reconstruct i from the conventional meaning s in Bender and Koller (2020)'s framework.

B Details from Crowd Worker Survey
Each crowd worker was shown a survey containing a system description abstract with instructions to provide their level of agreement on a five-point Likert opinion scale (strongly disagree, disagree, neutral, agree, strongly agree) to a set of six prompts intended to assess their level of "task understanding," "task fairness," "function understanding," "function fairness," "function trust," and "data trust." The prompts provided for each opinion value are provided in Table 5. After the users completed the six opinion questions they were instructed to press a button to hide the abstract, after which the retention questions were revealed. Each subject was compensated with $0.10 USD per survey, and subjects averaged a Human Intelligence Task (HIT) competition time of 35 seconds. This translates to an average hourly compensation of $10.29 per hour.

B.1 Opinion Prompts
All of the above opinion prompts-except for function fairness-produce outputs where "strongly agree" corresponds to a positive position, such as confidence in one's own understanding or in the performance of the system. Thus, we regularize the Opinion Value

Prompt Task Understanding
I understand what this system is meant to do. Task Fairness I believe this system is made to solve an ethical task. Function Understanding I understand how this system works.

Function Fairness
I am concerned about the fairness of this system. Function Trust I think this system can accurately perform its task. Data Trust I think this system will protect my privacy and data. reverse the polarity of the function fairness scores by reversing the order of numerical assignment to responses, assigning 5 to "strongly disagree" with being concerned about fairness, 4 to "disagree," etc.
While we are only interested in subjective user opinions about trust and fairness, for understanding we seek to analyze the "truth," as users might be overconfident in their pure opinions. To do this we must develop some feasible objective measure-for this study we use retention as a proxy for understanding.

B.2 Retention Evaluation
Out of a desire for some objective assessment of user confusion, we additionally ask participants to recall whether a set of phrases did or did not appear in the abstract they just read. The simple metric of retention accuracy reveals both how carefully a participant read the abstract and how much they maintained it in their memory-while these questions fail to directly capture user understanding in the way that a conceptual quiz would, they can be generated automatically and consistently across topic areas.
After completing the opinion question section of the survey, crowd workers are instructed to press a "Hide Passage" button. Once the button is pressed, the system description abstract is greyed out, and a set of five retention questions is revealed. Each abstract has its own set of retention questions, which are randomly generated prior by sampling sentences either from the abstract or from the other abstracts in the dataset.  S e a r c h N e w s S c i e n c e L a n g u a g e L e a r n i n g S u m m a r y W r i t i n g W r i t i n g

C.1 User Opinion Agreement
Although the opinion scores are subjective, and we don't necessarily expect agreement, some measure of agreement in opinions between users on abstracts is desirable to further support the validity of our abstract-wise correlation analysis. We are unable to use PCC as a measure of inter-rater reliability here because no two abstracts were rated by the same 10 crowd workers. Thus, we are restricted to analyzing the average within-abstract variance of each response variable. Table 6 contains the average abstract-wise variance for each of the opinion response variables.

C.2 Task Domain as a Confounder
For each abstract we collect topical keywords to handle the potential confound of differing user opinions by problem domain, to enable analysis for whether the broad problem area a system is intended to solve (e.g., newsreaders, language learning apps, translation tools) is a confounder driving positive or negative user attitudes. Figure 7 depicts the top keywords in our dataset. Because we are evaluating user attitudes toward   a diverse set of systems in distinct task domains, differences in popular perceptions toward the various tasks might bias user opinions. To ascertain if this is the case, we perform pairwise Student's t significance tests across all pairs of keywords present in 5 or more different abstracts, on each of the subjective response variables. Table 7 contains the pairs of keywords that have statistically significant differences between their distributions and the response variables along which these differences occur.
Of these, the most significant difference was between news and language learning abstracts on function fairness. Figure 8 depicts the distributions of the two sets on function fairness. It shows that the users tend to believe systems developed to aid in language learning are less prone to unfair bias than news reading, summarization, and synthesis applications. This difference could be driven by genuine differences in attitudes toward the components, subtasks, and processes used to build these systems, but there's a chance it is driven by societal attitudes toward learning and news instead.
Unfortunately, these pairwise keyword attitude difference tests are severely limited by small sam-ple sizes. It is likely that more interesting analysis could be performed on a larger dataset.

D arXiv Dataset Preparation
After sampling .tex files from the aforementioned ranges, they were converted into training data according to the following procedure: 1. If the manuscript is stored across multiple .tex files, collate them into a single one by replacing all \input and \insert commands with their corresponding file text. 2. Split the resulting file using the \begin{} and \end{} tags for abstract and document. 3. Using the pylatexenc python package 1 use the latex _ to _ text() command to convert the abstract and document latex code into unicode. 4. Remove all redundant whitespace, tabs, and new line characters, and convert all resultant text to lower case.

E Regarding use of PCC
A perennial comment in response to some outliers in our correlation plots is "why didn't you use Spearman's correlation coefficient instead?" In this case, we use PCC because we are interested in capturing relative deltas and not just rank-order correlations.

Opinion Questions
Please provide your agreement with the following questions on a five-point scale from "I strongly agree" to "I strongly disagree." I understand what this system is meant to do.
Strongly After you have completed the opinion questions, press "Hide Passage" to reveal the retention questions.
Hide Passage