An Inclusive Notion of Text

Natural language processing (NLP) researchers develop models of grammar, meaning and communication based on written text. Due to task and data differences, what is considered text can vary substantially across studies. A conceptual framework for systematically capturing these differences is lacking. We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP. Towards that goal, we propose common terminology to discuss the production and transformation of textual data, and introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling. We apply this taxonomy to survey existing work that extends the notion of text beyond the conservative language-centered view. We outline key desiderata and challenges of the emerging inclusive approach to text in NLP, and suggest community-level reporting as a crucial next step to consolidate the discussion.


Introduction
Text is the core object of analysis in NLP. Annotated textual corpora exemplify NLP tasks and serve for training and evaluation of task-specific models, and massive unlabeled collections of texts enable general language model pre-training. To a large extent, natural language processing today is synonymous to text processing.
But what belongs to text? More broadly, what information should be captured in NLP corpora and be available to the models at training and inference time? Despite its central role, the notion of text in NLP is fuzzy: while earlier work mostly focused on grammatical phenomena and implicitly limited text to written language, the application-driven NLP of the past years increasingly takes an inclusive approach to text by introducing non-linguistic elements into the analysis. Extensions vary from incorporating emojis to exploiting document struc-ture and cross-document relationships, and apply to all major components of the modern NLP infrastructure, incl. unlabeled text collections (Lo et al., 2020), language models (Aghajanyan et al., 2021) and annotated corpora .
The assumption that text in NLP solely refers to written language no longer holds ( Figure 1). Figure 1: Same textual document (a) can be seen in a variety of ways (b-c) depending on the assumed notion of text: while a syntax researcher might focus on written language (b), a summarization system can rely on document structure (c), and multimodal applications might use non-textual elements like tables and figures (d). Systematically capturing the differences between the assumed notions of text (top) requires a taxonomy of extended text. Such taxonomy is currently lacking. This is problematic for several reasons. From the representation perspective, machine learning assumes similarity between the source and the target distribution -yet lack of consensus on the notion of text might result in undocumented change of the input representation and degraded performance. From the modeling perspective, the accepted notion of text has major influence on task and model design, as it both determines the tasks NLP aims to tackle, and dictates the information available for solving those tasks. The final argument for studying the notion of text in NLP is conceptual: the capabilities of strong pre-trained Transformer models (Rogers et al., 2020) and general-purpose NLP frameworks (Wolf et al., 2020;Gardner et al., 2018;Akbik et al., 2019) have led to an explosive growth in NLP beyond traditional, core tasks. The exposure to new, rich source document types like scientific articles (Lo et al., 2020) and slides (Shirani et al., 2021) and the growing influence of multimodal processing  motivate the use of additional signals beyond written language in NLP. This leads to a general question on the scope of NLP as a discipline: if written language is no longer the sole object of study, what is, and how can it be formally delineated?
The study of how text is operationalized in NLP is hampered by the lack of documentation. While concurrent proposals address other key properties of NLP models (Mitchell et al., 2019) and corpora (Bender and Friedman, 2018;Gebru et al., 2018) like domain, language, speaker and annotator demographics -the field lacks common terminology and reporting schemata for documenting and formally discussing the assumed notion of text. To address this, we contribute the following: • A common terminology of text use in NLP (Section 2); • A taxonomy of text extensions beyond the language-focused approach to text (Section 4), based on commonly used sources of NLP data and the current state of the art; • Discussion of the challenges brought by the inclusive approach to text (Section 5); • A broadly applicable schema for reporting text use in NLP (Section 6). The notion of text is central to NLP, and we expect our discussion to be broadly relevant, with particular merit for the documentation policy, NLP applications, as well as basic NLP research.

Terminology
Textual data available to NLP is a result of multiple processes that determine the composition and properties of texts. To support our discussion, we outline the data journey a typical text undergoes, and introduce common terminology. Figure 2 illustrates our proposed model.
Text production. Every text has been produced by a human or by an algorithm with a certain communicative purpose. Raw text is rarely exchanged; to avoid ambiguity, we use the term document for a unit of information exchange 1 . Documents consist 1 There are many other kinds of documents, e.g. images, audio or code; here we focus on "textual" documents. Figure 2: A text is produced in an environment (a) and becomes part of the document space (b) that is sampled (c), often based on source (d). The sample is transformed into NLP artifacts (e) that are potentially reused and further refined across multiple studies (f) to produce further artifacts, etc. This process determines the notion of text assumed by the downstream NLP research and the capabilities of the resulting artifacts. of text along with additional structural and multimodal elements, serialized according to a certain format and accompanied by metadata. In our broad definition, textual documents include blog posts, Wikipedia articles and Tweets -as well as single dialogue turns and search queries. Few widely used textual document formats include plain text, Markdown, PDF, etc. Document space. All textual documents ever produced constitute the abstract document space. Document space incorporates both persistent textual documents that are stored (e.g. Wikipedia articles), and transient textual documents that only exist temporarily (e.g. search queries). Despite the apparent abundance of textual documents on the Web, most of the document space is not openly available, or is protected from research use by copyright-, privacy-related and technical constraints.
Sampling and sources Since capturing the entire document space is not feasible, a sample from the subspace of interest is used. Document space can be segmented in a variety of ways, incl. language, domain or variety (Plank, 2016), creation time, etc. One common way to sample textual documents is based on source: documents from the same source often share key characteristics like language variety, text production environment, format and licensing. Some widely used data sources in NLP are Wikipedia, PubMed, arXiv etc.
NLP Artifacts Finally, sampled textual documents are used to create artifacts, including reference collections like BooksCorpus (Zhu et al., 2015) and C4 (Raffel et al., 2020), and generalpurpose language models like BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020), that are widely reused across downstream NLP studies. The notion of text assumed by NLP artifacts is shaped both by the data journey and by the preprocessing decisions during artifact construction. These, in turn, determine how text is operationalized downstream. Due to the differences in how text is produced, sampled and captured, two NLP artifacts might assume very different notions of text. Yet, a framework to systematically capture this difference is lacking.

Prior efforts
Our proposal draws inspiration from several recent efforts in documenting other common properties of ML and NLP artifacts. Model cards (Mitchell et al., 2019) offer a compact way to capture core information about machine learning models incl. technical characteristics, intended and out-of-scope use, data preprocessing details etc. Data sheets (Gebru et al., 2018) focus on dataset composition, details of the data collection process, preprocessing, distribution and maintenance. In NLP, data statements (Bender and Friedman, 2018) focus on bias mitigation, detailing key aspects of NLP artifact production such as curation strategy, language variety, demographics of speakers and annotators, speech situation, topic and genre. Rogers et al. (2021) propose a formalised checklist documenting copyright-, bias-, privacy-and confidentiality-related risks. Formal proposals are mirrored by community efforts on data repositories like huggingface datasets (Lhoest et al., 2021); editorial guidelines 2 encourage the authors to report key parameters of NLP artifacts.
While existing approaches to NLP artifact documentation cover a lot of ground, the requirements for documenting the assumed notion of text remain under-specified. Our work is thereby complementary to the prior efforts: for example, our reporting schema (Section 6) can be seen as specification of the Speech Situation and Text Characteristics sections of the data statements (Bender and Friedman, 2018). Unlike prior approaches, our discussion above suggests that it is desirable to document the assumed notion of text at each step of the NLP data journey, incl. text production tools, document space samples, as well as NLP models and datasets, with a special focus on widely reused reference corpora and pre-trained language models.

Taxonomy of text extensions 4.1 Preliminaries
We ground our proposal in two sources. The text production stage is critical for NLP as it determines what information is potentially available to downstream processing; to approximate what information could be used by NLP artifacts, we (1) conduct an analysis of four representative document sources widely employed in NLP. On the other side of the data journey are the NLP artifacts, the end-product of NLP preprocessing, modeling and annotation. To approximate what information is being used by NLP, we outline the de-facto, conservative approach to text and (2) survey recent efforts that deviate from it towards a more inclusive notion of text.
Sources. Wikipedia 3 (Wiki) is a collaborative encyclopedia; it has been widely used as a data source for task-specific and general-purpose NLP modelling. BBC News 4 (BBC) represents newswire, one of the "canonical" domains characterized by heavily edited written discourse. StackOverflow 5 (Stack) is a Q&A platform that represents usergenerated technical discourse on social media. Finally, ACL Anthology 6 (ACL) is an open repository of research papers from the ACL community and represents scientific discourse -a widely studied NLP application domain. For our analysis we sampled five documents from each of the data sources (Appendix A): for Wiki, we selected featured articles from five distinct portals to ensure variety; from BBC we selected top five articles of the day 7 ; for Stack we used five top-rated question-answer threads; for ACL, we picked five papers from the proceedings of ACL-2022 available online.
Baseline: Written language. The conservative, de-facto approach to text in NLP is "text as written language": parts of source documents that contribute to grammatical sentences constitute the primary modeling target, whereas non-grammatical elements are considered noise and potentially discarded. This tradition is persistent throughout the history of NLP, from classic NLP corpora (Pradhan and Xue, 2009;Marcus et al., 1993) and core NLP research, to modern large-scale unlabeled cor-pora used for model pre-training (Zhu et al., 2015;Raffel et al., 2020;Merity et al., 2016), language models (Devlin et al., 2019;Brown et al., 2020) and benchmarks (Wang et al., 2018). While focus on text as written language is justified for grammatical and formal semantic analysis, for other use cases it proves limiting, and below we survey the emerging inclusive approaches to text that exploit non-linguistic signals to boost the performance and to enable new applications of NLP. Table 1 summarizes our proposed two-tier taxonomy for describing the inclusive approaches to text. It demonstrates the wide variety of signals available and potentially relevant for NLP processing beyond the conservative, language-centric view. The following sections discuss the taxonomy classes in greater detail, and Figure 3 provides examples.

Body
The first high-level class of our taxonomy encompasses the phenomena related to the main, contentbearing parts of the textual document.
A1: Content. The analysis of our sources reveals that naturally occurring textual documents systematically use multiple signal systems besides written language per se. The examples of non-linguistic information in textual documents include, but are not limited to, emojis, mathematical notation, code, hyperlink-, citation-and footnote anchors, tables and multimedia, as well as arbitrary numerical and categorical information like scores and ratings (e.g. on STACK). The stance towards such non-linguistic elements of text ultimately determines whether an NLP artifact can represent them in a satisfactory manner, and recent NLP works successfully use non-linguistic elements to their advantage. Applications in sentiment analysis make use of emoji (Felbo et al., 2017); recent research addresses the questions related to text generation based on tables (2021) integrate layout information into language model pre-training, resulting in improved performance across a wide range of tasks. The ability to handle non-linguistic signals is key for NLP applications and motivates careful documentation of text content. A2: Decoration. Content is complemented by decoration across all of our sources. Text decoration can take the form of font change, style, coloring etc. and carries important secondary information for the readers, incl. emphasis, quotation, and signaling Structure (below). An important function of text decoration is to mark code-switching between different signal systems, from simple language change to the switch to mathematical notation and code, e.g. on STACK and ACL. Over the past years, decoration received some attention in NLP: Shirani et al. (2019Shirani et al. ( , 2020 explore the task of emphasis modeling in visual media, Shirani et al. (2021) extend it to presentation slides. While humans widely use text decoration, the semantics of decoration are source-and author-dependent and require further systematic investigation.
A3: Structure. Most naturally occurring textual documents are not a flat, linear text as assumed by commonly used reference corpora, from Penn Tree-Bank (Marcus et al., 1993) to BooksCorpus (Zhu et al., 2015). Instead, the relationships between individual units of content are encoded in document structure. The simplest form of structure is paragraph; longer documents often exhibit a hierarchy with sections and subsections; visual proximity is widely used to include additional content blocks like quotations, definitions, footnotes, or multimedia. For print-based media, textual documents can be further organized into pages, columns, lines etc. Explicit document structure is increasingly used in NLP: Cohan et al. (2019) use section information to help citation intent prediction; Ruan et al. (2022) exploit document structure to aid summarization; Sun et al. (2022) exploit structure to study the capabilities of long-document language models;  propose Intertextual Graph as a general structure-aware data model for textual documents and uses it to support annotation studies and explore how humans use document structure when talking about texts. Document structure is implicitly used in HTML-based pre-training of language models (Aghajanyan et al., 2021), leading to superior performance on a range of tasks, and enabling new pre-training strategies; a separate line of study is dedicated to the analysis of visual document layouts (Shen et al., 2021). The lack of a common approach to formalizing document structure calls for systematic reporting of what structural elements are available in sources, and how document structure is represented and used in NLP.

Context
The second high-level class of our taxonomy pertains to context. Every text is written and read in the context of other texts, and the ability to capture and use context is a key property of NLP artifacts.
B1: Linking. The first major contextualization mechanism is explicit linking -an marked relationship between an anchor text and a target text . Linking is crucial to many text genres and is found throughout the document sources considered in our analysis. An intradocument link connects two elements within one textual document (e.g. reference to a chapter or footnote), while a cross-document link connects elements in different documents (e.g. hyperlinks and citations). Links differ in granularity of their anchors and targets: the same Wiki page can cite its sources on the level of individual sentences (sentence to document) and as a list for further reading (document to document); a research article from ACL can reference a particular statement in a cited work (sentence to sentence). Few recent works in NLP tap into the narrow context for both task-  (Zhang et al., 2019;Iv et al., 2022;Schick et al., 2022;Spangher et al., 2022). Like linking, adjacency provides a rich, naturally occurring type of contextualization.
B3: Grouping Finally, a textual document can be contextualized by the region of the document space that it belongs to: for example, a Wiki page exists in the context of other pages belonging to the same portal; a BBC article is positioned along the other articles of the same day or topic. Group context both provides the expected common background for text interpretation and sets the standards for the composition of individual textual documents. Group context plays key role in designing discourse segmentation schemata (Teufel et al., 2009;Hua et al., 2019;Kennard et al., 2022), can provide natural labels for text classification, and has been used to augment language models (Caciularu et al., 2021).

Remarks
Completeness. Our taxonomy serves as the first attempt at capturing the notion of text used in NLP in a structured manner. While we believe that the high-level taxonomy provided here is comprehensive, due to our focus on textual documents we do not incorporate further divisions related to multimedia content (e.g. we do not distinguish between images and graphics, although such distinction could be of interest for some applications). As more sources and NLP artifacts are documented, new lower-level taxonomy classes are likely to emerge.
Interactions. The proposed taxonomy dimensions are not orthogonal and do interact: for example, group context (B3) can influence document structure (A2) and decoration standards (A3); in turn, decoration is widely used to signal document structure and linking (B1); the presence of adjacent context (B2) can affect the level of detail in the content (A1). The existence of such inter-dependencies motivates joint documentation and analysis of the different aspects of text even if a conservative notion of text is adopted in the end.

Interoperability and generalization
A great advantage of the conservative, writtenlanguage-only view on text is wide interoperability and generalization: any textual document -from scientific articles to Tweets -can be reduced to written language. This makes it possible to apply a BERT model trained on books to a questionanswering prompt and expect non-trivial performance, and enables reuse of text processing frameworks and annotation tools. Yet, such reduction leads to substantial information loss and bears the danger of confounding due to the interactions between different aspects of text and the text body. While isolated efforts towards inclusive notion of text exist, we are not aware of general approaches that would allow capturing different aspects of text in a systematic manner across domains and document formats. While arriving at a universal, general inclusive notion of text for NLP might not be feasible, we believe that reflecting on the generalization potential of non-linguistic textual elements is the first step in this direction.

Impact of production environment
Text production environment plays a key role in what information can be captured by the textual document, which, in turn, determines the capabilities of the downstream NLP artifacts. While a sophisticated text editing interface promotes the use of decoration, non-textual content, structure and linking, a plain text-based input field doesn't. Moreover, regulating documents and norms that accompany text production have a profound impact on text composition: for example, in addition to common expectations of a scientific publication, ACL provides document templates, sets page limitations and often enforces obligatory structural elements e.g. reproducibility and limitation sections; Wiki is supplied with extensive general and portaland page-specific guidelines, as well as strict formatting requirements enforced by the community; similar mechanisms are characteristic to most other sources of textual data. Finally, text production environment might determine the availability of adjacent and group context during text production. Despite its crucial role, we are not aware of NLP studies that investigate the impact of the text production environment on the resulting texts, and believe that our taxonomy might serve as a viable scaffolding for such studies.

Implications
Efficiency. Computational demands of NLP research are a growing concern (Strubell et al., 2019). It remains unclear how the transition to inclusive treatment of textual documents might affect the efficiency of NLP models. Modeling additional aspects of text might require more parameters and increase the computational demands; yet, the synergies between different aspects of text might allow NLP models to converge faster during training. We are not aware of NLP studies that systematically investigate the effects of inclusive approach to text on training of NLP models, and believe that this question requires further scrutiny.
Ethics. Recent years are marked by increased attention to the ethics of NLP research, broadly including the issues of privacy, confidentiality, licensing and bias (Rogers et al., 2021;Bender and Friedman, 2018;Dycke et al., 2022). While some types of information beyond written language do not constitute a threat as they are openly accessible in the source textual documents (e.g. textual content A1, decoration A2 and structure A3), others are potentially harmful: precise details of text production might impact privacy, and inclusion of certain contexts (e.g. edit histories, B2) might expose NLP artifacts to false and incomplete information. We are not aware of systematic NLP research into what types of non-linguistic information about textual documents are safe to store and report.
Methodology Current NLP methodology is tailored to a conservative approach to text -from commonly reported statistics (e.g. number of words) to modeling objectives and evaluation metrics. The transition towards an inclusive notion of text calls for a careful revision of the NLP practice. Dataset statistics might include information like the number of figures and tables (A1) or structural information on intra-document (A3) and inter-document (B1-3) level. Pre-trained language models would need to process new types of content, structure and context. Evaluation metrics would need to take into account the new signals; in addition, machine learning models readily exploit artifacts in data during learning (Gururangan et al., 2018) -and besides providing a useful training signal, inclusive notion of text introduces new potential sources of heuristic behavior and bias. Future research must determine the optimal ways to operationalize the inclusive approaches to text in NLP.

Reporting
An inclusive approach to text is an emerging trend in NLP that demands systematic study. While preparing this work, it became evident that the lack of structured reporting limits the meta-analysis of text use in NLP. In line with related documentation efforts, we propose a simple mechanism for reporting text use. If adapted at large, such reporting will make it easier to gauge the capabilities of data sources and NLP artifacts, increase community awareness on what aspects of text are represented and used, allow aggregation of text use information from different studies, and help set standards for inclusive approach to text in NLP.

Schema
As our proposed taxonomy is subject to extension, and to keep the reporting effort low, we formulate proposed reporting schema as a set of open-ended questions guided by examples in Table 1. We encourage the reporters to complement it with new categories and phenomena if necessary. For each NLP study that uses or creates textual documents or NLP artifacts, we propose to include the following information into the accompanying publication: • Body: Does the source, format, dataset, model or tool incorporate or use any information apart from written language, incl. nonlinguistic content, decoration and structure? • Context: Does the source, format, dataset, model or tool incorporate or make use of additional context beyond single document, incl. by linking, adjacency or via group context? If yes, what is it and how is it used? In addition, for text document sources and interactive NLP models we propose to document the production environment: How are the documents produced, incl. guidelines, software and hardware used? Are the documents single-authored or written collaboratively? How can these factors influence text body and context? Optionally, we invite researchers to reflect upon the implications of their approach to text for generality, efficiency, ethics and NLP methodology. Is the newly introduced signal widely used across textual documents? Does it introduce computational overhead or help reduce computational cost? Can the newly incorporated information lead to bias, privacy risks or promote heuristic behavior? How does the selected methodology take the non-linguistic nature of the new information into account?

Example
To illustrate the intended use of the proposed schema, we provide example documentation for a textual source (StackOverflow). We note that despite the brevity, short form and potential incompleteness, this kind of documentation is highly informative as it both allows to quickly grasp the notion of text assumed by a data source or artifact, and to aggregate this information across different kinds of NLP studies in the future.
StackOverflow hosts three main types of textual documents: questions, answers and commentaries. (A) Body: documents are richly formatted, include multiple content types (text, code, math, images) and decoration (emphasis, code-switching, links). Documents are associated with additional metadata, author and creation/edit time; questions and answers are assigned a rating (number of votes), questions are tagged. Basic structure is present: questions and answers can be logically structured; questions are titled; yet, commentaries are usually short and not structured. (B) Context: linking is used throughout, mostly via hyperlinks, both to the documents on the platform and to external documents; questions, answers and commentaries are related by adjacency; revision histories are available for questions and answers; questions are grouped via tags, and answers and commentaries are grouped by question. Production: questions and answers are entered via a UI based on Markdown 8 , that supports formatting, structuring, lists, links, code and block inserts, and table formatting. The question submission form additionally includes a title and a tag field. While posting the answer, the user has direct access to the question, previous answers and commentaries. Guidelines for asking and answering questions are available 9 and enforced both by explicit moderation and by the community.

Implementation
Unlike prior proposals that focus on documenting datasets and models separately, our schema applies to all stages of the NLP data journey, from data sources to NLP artifacts, and to all kinds of NLP artifacts including reference corpora, labeled corpora, preprocessing tools, pre-trained and end-task models and applications. The schema can be incorporated into the Data statements and editorial guidelines. We encourage the community to make use of this mechanism as a step towards better interoperability of NLP artifacts and the systematic study of the inclusive notion of text. We specifically highlight the need for documenting commonly used sources of textual documents; this will provide the NLP community with a better picture of the document space. We deem it equally important to document widely reused pre-trained language models and reference corpora, since their capabilities have a major impact on downstream NLP modeling and applications. This would allow us to gauge how far NLP is from accurately modeling the document space, and will highlight the gaps future work would need to address on the way towards a generally applicable inclusive approach to text.

Conclusion
Text plays the central role in NLP as a discipline. But what belongs to text? The rise in applications of NLP to non-linguistic tasks motivates an inclusive approach to text beyond written language. Yet, the progress so far has been limited to isolated research efforts. As NLP ventures into new application areas and tackles new tasks, we deem it crucial to document the notion of text assumed by data sources and NLP artifacts. To this end, we have proposed common terminology and a two-tier taxonomy of inclusive approaches to text, complemented by a widely applicable reporting schema. We hope that our contributions and discussion help the community systematically approach the change of NLP scope towards more accurate modeling of text-based communication and interaction.
FLAIR: An easy-to-use framework for state-of-theart NLP.