Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents

Recent work has shown that infusing layout features into language models (LMs) improves processing of visually-rich documents such as scientific papers. Layout-infused LMs are often evaluated on documents with familiar layout features (e.g., papers from the same publisher), but in practice models encounter documents with unfamiliar distributions of layout features, such as new combinations of text sizes and styles, or new spatial configurations of textual elements. In this work we test whether layout-infused LMs are robust to layout distribution shifts. As a case study we use the task of scientific document structure recovery, segmenting a scientific paper into its structural categories (e.g.,"title","caption","reference"). To emulate distribution shifts that occur in practice we re-partition the GROTOAP2 dataset. We find that under layout distribution shifts model performance degrades by up to 20 F1. Simple training strategies, such as increasing training diversity, can reduce this degradation by over 35% relative F1; however, models fail to reach in-distribution performance in any tested out-of-distribution conditions. This work highlights the need to consider layout distribution shifts during model evaluation, and presents a methodology for conducting such evaluations.


Introduction
Humans use layout to understand the organizational structure of visually-rich documents such as scientific papers, newspaper articles, and web pages. For instance, a reader might use fontsize and boldfacing to recognize a section title, while they might use spatial location to recognize a footnote. Based on the intuition that layout aids in document understanding, recent work introduced layout-infused language models (LMs). To improve document processing, these models incorporate layout features Figure 1: Model performance on document structure recovery, comparing training and testing in-distribution vs out-of-distribution. Error bars indicate standard deviation across runs. Layout distribution shifts degrade model performance by up to 20 F1 (Section 5.2). Simple training strategies such as few-shot fine-tuning and increasing training diversity partially mitigate the drop shown here (Section 5.3, Section 5.4).
such as the styling, size, and spatial configuration of document text. However, these features often change between documents -are layout-infused LMs robust to shifts in layout distribution?
With rising interest in processing visually-rich documents, some language models have been augmented with components specifically designed to process layout features (e.g., . Layout-infused models accurately process documents with similar layouts to those seen during training (Shen et al., 2022;Huang et al., 2022b), and can leverage visual information to better understand long-range dependencies (Nguyen et al., 2023). But in practice, models often encounter documents with different layouts -for instance, pages with a different number of columns, a different density of words on the page, and different locations of textual elements. In order to realistically evaluate model performance, we study model performance under layout distribution shifts. Although robustness to text-distribution shifts has been relatively well-studied, layout distribution shifts pose unique challenges and may require unique solutions; therefore, we focus specifically on robustness to layout distribution shifts.
Here we present a case study to evaluate model robustness to layout distribution shifts. Our case study focuses on the task of segmenting a scientific paper into its structural categories. For example, a scientific paper might be segmented into categories such as TITLE, AUTHORS, CAPTION, PARAGRAPH, and REFERENCE. We refer to this task as document structure recovery. 2 We chose to focus on this task because it serves as a testbed to determine whether layout-infusion remains beneficial under layout distribution shifts. Document structure recovery requires grasping the organizational structure of a document, a key piece of information that layout conveys. Moreover, layout-infused models have been shown to reach state-of-the-art performance on this task, in settings where the layout distribution is the same between training and testing (Shen et al., 2022;Huang et al., 2022b).
To test model robustness, evaluations must train and test on examples drawn from different layout distributions. However, existing datasets for document structure recovery use random train-test splits, and thus are ill-suited for evaluating model robustness. In this work we leverage publisher metadata to construct train-test splits that reflect layout distribution shifts. Publisher metadata is a proxy for layout distribution shifts because publication venue is a key driver of layout differences -different publishers adhere to different style guides and templates (e.g., Figure 2). Moreover, layout differences across publishers reflect layout distribution shifts faced in practice as new publishers, templates, and style guides arise. We use publisher metadata to propose new train-test splits of an existing dataset (GROTOAP2, Tkaczyk et al. (2014)) for scientific document structure recovery. These splits are designed to reflect Layout Distribution Shifts; hence we refer to the splits as GROTOAP2-LDS.
Using GROTOAP2-LDS, we evaluate a set of layout-infused LMs and find that model performance degrades by up to 20 F1 under layout distribution shifts (Figure 1). We show that layoutinfused models can quickly adapt to new distributions, and that increasing training diversity can improve model robustness. However, even with diverse training sets and fewshot fine-tuning, performance on out-of-distribution layouts remains more than 2 F1 below in-distribution performance. Although layout-infusion aids in processing documents with in-distribution layouts, layout-infused LMs may overfit on features seen during training. We release our code and evaluation suite to enable future evaluations and to facilitate expansions of our evaluation suite.

Background and Related Work
Recent work established that models are often sensitive to distribution shifts. Shifts in the distribution of text or image statistics have been shown to substantially degrade model performance (e.g., Geirhos et al., 2020;Bai et al., 2021;Ye et al., 2021;Miller et al., 2020;Koh et al., 2021), even in cases when human performance is robust to these distribution shifts (Miller et al., 2020). For example, question-answering models struggle to generalize from text in Wikipedia to text in newspaper articles (Miller et al., 2020), and image classification models struggle to generalize between images taken from different cameras (Koh et al., 2021). We extend this line of research to study robustness to shifts in the distribution of layout features.
Document structure recovery provides an opportune setting for evaluating robustness to layout distribution shifts. Solving this task requires understanding how the text and visual layout of a page convey the organizational structure of the document, and layout-infused LMs have been shown to reach near-human performance on this task (Tkaczyk et al., 2014;Shen et al., 2022). Prior work has shown that models transfer poorly across different document types (e.g., from scientific papers to financial documents) (Pfitzmann et al., 2022). Although different document types exhibit differences in layout, they also exhibit large differences in other features, such as the textual domain and the distribution of structural categories. It is therefore unclear whether poor transfer across document types is due to layout distribution shifts or other factors. In this work, we experiment using train-test splits exhibiting different layout distributions but with documents of the same type (i.e., scientific papers from biomedical journals). 3 Existing evaluation datasets for document structure recovery include many document types, such as scientific papers (Tkaczyk et al., 2014;Zhong et al., 2019), forms (Jaume et al., 2019), receipts (Park et al., 2019, and long-form business documents (Grali'nski et al., 2020;Pfitzmann et al., 2022). We focus on scientific papers, where layout distribution shifts are prevalent (Figure 2). Although existing datasets for scientific document structure recovery contain documents with different layouts, existing train-test splits do not reflect layout distribution shifts.

Evaluation Methodology
To evaluate model robustness, we propose a set of new train-test splits of GROTOAP2. These splits reflect layout distribution shifts that occur in practice, and we refer to this set of splits as  In this section we formally define our task (Section 3.1), describe our procedure for partitioning data into splits that emulate layout distribution shifts (Section 3.2), and present a specific benchmark for evaluating robustness to layout distribution shifts (GROTOAP2-LDS, Section 3.3).

Task Definition
For each page, a model receives N words w 0 , ..., w N in detected reading order. Layoutinfused models receive additional page features, such as the x-and y-coordinates of the bounding box of each word or an image of the page. Given these inputs, the model must predict category labels y 0 , ..., y N , one for each word, where y i is selected from a set of structural page categories (e.g., TITLE, CAPTION, AUTHORS)

Dataset Construction Procedure
We focus on layout distribution shifts within scientific papers, but our data partitioning procedure is agnostic to the particular document type. In the future, this procedure could be used to evaluate model robustness with other types of documents, such as receipts from different vendors or articles from different newspapers.

Document-Level Layout Assignments
Previous evaluation setups assigned dataset splits at the page level, sometimes placing different pages of the same document in both the train and the test set (Tkaczyk et al., 2014;. However, layout formatting decisions are often made at the document-level, and the layout of different pages in a multi-page document are often highly dependent on each other. We therefore consider layout in terms of whole documents, and assign dataset splits at the document level. Provenance Metadata as a Proxy for Layout Distribution Shifts For scientific papers, different publishers format papers with different layouts (Tkaczyk et al., 2014), and layout differences across publishers reflect distribution shifts that may occur in practice. We therefore use publisher metadata to partition documents into different dataset splits. Existing datasets often do not preserve provenance metadata, instead including only the content and task labels for each document (e.g., Tkaczyk et al., 2014;Zhong et al., 2019). Fortunately, scientific literature citation tools provide a way to recover publisher metadata for scientific papers. To link each paper to its associated publisher, we access the Semantic Scholar database to obtain the journal based on the title of each publication in GROTOAP2, and then map from each journal to the corresponding publisher.

The GROTOAP2-LDS Benchmark
We use the procedure described in Section 3.2 to construct GROTOAP2-LDS, a set of train-test splits that evaluate model performance under different training conditions.

Test splits
We construct four test sets, each of which contains papers from a held-out publisher (ACTA, BMC, PLOS, RU). GROTOAP2 contains a large number of papers from each of these publishers (at least 300 papers per publisher, or about 1 million words), ensuring enough data to compute reliable estimates of in-distribution and outof-distribution performance. 4 Each of these four 4 Other publishers also met the minimum number of papers/words (e.g., Nucleic Acids Research). We focused on four publishers to keep the number of experiments tractable.  publishers contains papers from a qualitatively distinct layout distribution ( Figure 2). The same four test sets are used to evaluate models under each train condition. For each publisher, 20% of papers were used as a held-out test set, and the remaining 80% of papers were used in certain training conditions (e.g., to compute an estimate of in-distribution performance). The test sets contain an average of 75 papers (≈500,000 words) each.  Table 1.
In-Distribution (ID) Training For each of the four held-out test publishers, we construct a training set with papers from the same publisher. Papers from the same publisher exhibit different layouts, but layout differences between papers within the same publisher are small relative to differences between papers from different publishers. We therefore refer to settings in which models are trained and tested on papers from the same publisher as the "in-distribution" setting, and settings involving transfer across publishers as the "out-ofdistribution" setting. Model performance in this setting is used to estimate the performance drop between in-distribution and out-of-distribution layouts.
Out-of-Distribution (OOD) Training We construct training sets that evaluate model performance under layout distribution shift. The number of train papers is matched between training sets. Each training set contains roughly 2,000 papers (≈10,000,000 words). We construct training sets reflecting different levels of layout diversity. Our default training approach ("LIMITEDPUBLISHER") is a leave-one-publisher-out setting in which each model is trained on three publishers and tested on the held-out fourth publisher. To evaluate the impact of training set diversity on robustness to layout  distribution shifts, we construct datasets with 25 publishers ("LIMITEDPUBLISHER+") or 125 publishers ("LIMITEDPUBLISHER++").
To quantify the diversity of spatial configurations in each training set, we measure the breadth (B) of spatial locations covered by each structural category. To compute B, for each structural category we count the proportion of spatial x-y positions where that category occurs, 5 and then compute the mean across categories. The value of B for each data split is included in Table 2.
Few-shot Adaptation In practice, it may be possible to cheaply annotate a few papers from a new layout distribution (e.g., when a trained model is applied to papers from a new publisher.) To test how quickly models can adapt to a new layout distribution, we additionally evaluate models in settings in which models are first trained on an outof-distribution training set, and are then fine-tuned on a small amount of in-distribution data. Specifically, before testing models on each of the test sets, we perform few-shot fine-tuning with a few annotated examples (10 papers, ≈ 50, 000 words) from the held-out test publisher.

Models
We evaluate on BERT, LayoutLM, LayoutLMv2, LayoutLMv3, and SciBERT (we use the base uncased version of each model). 6 The three layoutinfused models (LayoutLM, LayoutLMv2, Lay-outLMv3) share the same model size and underlying architecture as BERT (Devlin et al., 2019). The equivalence in model size facilitates direct comparisons between different methods of incorporating layout features. Each layout-infused model is adapted to use layout features such as text position or page image embeddings on top of the standard BERT architecture. These layout-infused models have previously been shown to achieve state-of-theart performance for processing visually-rich text documents with in-distribution layouts Huang et al., 2022b;Shen et al., 2022). We briefly describe these layout-infused models, and defer to the original papers for more specific details about model architecture and training.
LayoutLM : LayoutLM is initialized from BERT, and then adapted to incorporate information about spatial text position. Masked visual-language modeling and multi-label document classification are used to adapt the model to incorporate the layout-specific components.
LayoutLMv2 : LayoutLMv2 is initialized from BERT, and then adapted to incorporate spatial text position as well as image embeddings of page regions. Masked visual-language modeling and text-image alignment are used to adapt the model to incorporate the layout-specific components.
LayoutLMv3 (Huang et al., 2022b): Lay-outLMv3 is initialized from RoBERTa, and then adapted to incorporate spatial text position as well as image embeddings of page patches. Masked language modeling, masked image modeling, and word-patch alignment are used to adapt the model to incorporate the layout-specific components.
We additionally evaluate on SciBERT (Beltagy et al., 2019), which is pretrained with the same pretraining tasks as BERT, but instead with data from scientific texts. SciBERT allows us to compare the benefit of layout-infusion with the benefit of simply using a model pretrained on in-domain text.
I-VILA tokens, which provide a textual indication of visual group boundaries as part of model input, have been shown to improve performance on document structure recovery (Shen et al., 2022).
Our preliminary experiments showed that I-VILA tokens improve performance across all experimental settings. Therefore for all reported experiments, we use block-level I-VILA tokens provided by Shen et al. (2022).

Implementation Details
We implemented experiments in PyTorch, using the transformers library to access pretrained models (Paszke et al., 2019;Wolf et al., 2020). The learning rate for each model was selected by train-ing each model with a learning rate of 1e-04, 1e-05, and 1e-06, and selecting the learning rate with the best dev set performance. This learning rate sweep was done separately for the initial training phase and for few-shot fine-tuning (see Appendix for details). The initial training stage included a linear warmup schedule with 2000 steps. The adamW optimizer with β 1 = 0.9, β 2 = 0.999 was used during training. During each episode of few-shot finetuning, eight papers were used as the train set and two papers were used as the dev set. No warmup was used during few-shot fine-tuning. Batch size four was used throughout training. Models were trained for a maximum of 10 epochs during the initial training phase and a maximum of 250 epochs during few-shot fine-tuning. Dev set performance was used for early stopping. For each model, the initial training phase was performed over three random seeds. For each random seed, few-shot finetuning was performed over three different episodes. Tables in Section 5 show the mean and standard deviation over random seeds and episodes. Full experimental results are included in the Appendix.

Results
We use GROTOAP2-LDS (Section 3.3) to evaluate robustness to layout distribution shifts. For each experimental condition, models are evaluated on four test sets, each containing papers from a heldout layout distribution. Unless indicated otherwise, model performance is reported as the average across these four test sets.

Layout-infused LMs Perform Best on In-Distribution Layouts
In-distribution performance of each model is shown in Table 3. Consistent with prior work, we find that layout-infused LMs reach the highest performance for documents with in-distribution layouts. 7 In subsequent sections, we use ∆ ID to refer to the difference between model performance on this in-distribution training condition, and on out-ofdistribution training conditions.

Models Overfit to Layout Distributions Seen During Training
To evaluate model robustness to layout distribution shifts, we train models on papers from three publishers (LIMITEDPUBLISHER), and then test on papers from a held-out test publisher. Model performance for each of the test sets is shown in Table 4. Compared to in-distribution performance (Table 3), out-of-distribution performance drops between 15.38 and 20.22 F1 (∆ ID ). Layout-infused models perform worse than SciBERT, a model not pretrained with layout-specific components. Although layout-infused models achieve the highest performance for in-distribution layouts, these models overfit to layout distributions seen during training. In settings in which models need to generalize to out-of-distribution layouts, models with in-distribution text pretraining (as with SciBERT) may be more effective.

Models Can Quickly Adapt to Layout Distribution Shifts
In practice, it may sometimes be possible to cheaply annotate a few papers from a target distribution (e.g., when a system ingests papers from a new publisher). To test how well models can quickly adapt to a new layout distribution, we first train models on out-of-distribution layouts  splits, we perform few-shot fine-tuning with ten papers from the held-out test publisher, and then evaluate on the test set for that publisher. Table 5 shows model performance in this setting. From the in-distribution to out-of-distribution settings, model performance drops between 3.3 and 4.3 F1 (∆ ID ). Although model performance falls substantially below in-distribution performance, few-shot adaptation to the target distribution reduces the performance drop by over 80% compared to settings in which models must directly generalize to the new distribution (Table 4). After few-shot adaptation, LayoutLMv2 achieves the highest out-of-distribution test performance, suggesting that layout-infusion may help models adapt more quickly to new layout distributions.

Increasing Layout Diversity Observed During Training Can Improve Robustness
To determine whether layout-diverse training can improve model robustness, we train models on papers from more publishers while holding the total number of papers constant (the LIMITED-  PUBLISHER+ and LIMITEDPUBLISHER++ training sets described in Section 3.3). Model performance for each training diversity condition is shown in Figure 3. Performance is shown separately for settings in which models must generalize directly to papers from a different layout distribution (as in Section 5.2), and for settings in which models are fine-tuned on a few annotated examples from the target distribution (as in Section 5.3). When models must generalize directly to papers from a different layout distribution, a change from training on LIMITEDPUBLISHER to LIMIT-EDPUBLISHER+ increases test performance on outof-distribution layouts by a mean of 9.91 F1 over models. A further increase in diversity from LIM-ITEDPUBLISHER+ to LIMITEDPUBLISHER++ increases performance by an additional 0.28 F1. In settings where models receive a few annotated examples to adapt to the target distribution (e.g., Section 5.3), training on LIMITEDPUBLISHER+ rather than LIMITEDPUBLISHER yields a much smaller performance gain (1.53 F1). In few-shot adaptation settings, a further increase from training on LIM- Figure 3: Out-of-distribution performance vs training diversity. Test macro-F1 on document structure recovery. LP=LIMITEDPUBLISHER. Error bars reflect standard deviation over trials. (1) Increasing training diversity improves robustness to layout distribution shifts, but even the highest training diversity condition does not reach ID performance. (2) Increasing training diversity provides diminishing benefits. (3) Benefits of training diversity overlap with benefits from few-shot adaptation. ITEDPUBLISHER+ to LIMITEDPUBLISHER++ results in a -0.19 F1 drop in performance.
These results suggest that increasing the diversity of layouts observed during training can improve model robustness, but that this strategy provides diminishing returns as training diversity continues to increase. Furthermore, the benefits of increasing training diversity may largely overlap with the benefits of few-shot adaptation to the target distribution. Even in the most favorable outof-distribution setting, in which models are trained on the most beneficial training diversity condition and then fine-tuned on a few papers from the target layout distribution, model performance is at least 2 F1 below in-distribution performance.

Error Analysis
In practice, shifts in layout and text distribution are highly correlated. For instance, papers written for different scientific communities differ in both textual content and visual layout. To understand whether performance drops are driven by changes in layout, we analyzed model performance in the most difficult generalization setting (LIMITEDPUBLISHER with no few-shot adaptation). We examined whether generalization er-rors typically occurred for categories for which layout changes the most. Figure 4 shows the performance drop between in-distribution and out-ofdistribution settings for each structural category. Categories with the largest performance drops are those which are often characterized by spatial location, such as PAGE_NUM (-51.9 F1), BIBLIO-GRAPHIC_INFO (-25.0 F1), and ACKNOWLEDGE-MENTS (-22.7 F1). In contrast, much smaller performance drops occurred in categories containing the main textual content of the paper, such as BODY (-9.2 F1) and ABSTRACT (-14.1 F1).

Conclusion
This work studies whether layout-infused models are robust to layout distribution shift. We present a method for evaluating robustness to layout distribution shift, and construct GROTOAP2-LDS, a new set of splits for the GROTOAP2 dataset that evaluate model robustness to layout distribution shifts. We use GROTOAP2-LDS to evaluate a set of existing layout-infused models (LayoutLM, Lay-outLMv2, and LayoutLMv3), and compare against two text-only models (BERT, SciBERT).
Layout-infused models perform most accurately on documents with familiar layouts (Table 3), but Figure 4: Performance drop per category, lowdiversity training. Performance drop (∆ ID ) between in-distribution and out-of-distribution layouts for each structural category. Largest performance drops occur for categories characterized by spatial page location (e.g., PAGE_NUM, BIBLIOGRAPHIC_INFO). In contrast, much smaller performance drops occured in categories that contain the main textual content of the paper (e.g., BODY, ABSTRACT) in settings where models must generalize to documents with unfamiliar layouts, layout-infused models underperform text-only models such as SciB-ERT (Table 4). In such settings, models with indomain text pretraining both provide more accurate results, and obviate the inference time cost of processing visual layout features (e.g., image embeddings LayoutLMv3 increase inference time by ≈ 10×).
We hypothesize that layout-specific components overfit more because they receive less pretraining data compared to text-only components, or because they increase total model parameter count (e.g., LayoutLM, LayoutLMv2, and LayoutLMv3 contain 20-45% more parameters than BERT and SciB-ERT). 8 Future work could test whether larger-scale pretraining improves robustness of layout-specific components.
We show that training strategies such as increasing training diversity or few-shot adaptation to the target layout distribution can mitigate the performance drop across layout distribution shifts. These results provide guidance for curating training data and highlight the importance during data collection of curating examples that reflect variation in document provenance. In situations with a known change in layout distribution (e.g., if a system trained on papers from one publisher is re-used to process papers from a new publisher), the cost of annotating a few examples from the target distribution may be highly effective, resulting in a large improvement in out-of-distribution model performance.
This work highlights the importance of considering layout distribution shifts when evaluating models on tasks involving visually-rich documents such as scientific papers. We hope that our study and evaluation methodology facilitate the development of layout-infused models that can generalize across layout distribution shifts.

Limitations
We use scientific papers as a first testbed for evaluating model robustness to layout distribution shifts. Many different layouts exist among scientific papers, and the existence of metadata databases facilitated the construction of train-test splits with layout distribution shifts. However, scientific papers are only one domain in which layout distribution shifts occur. Layouts also vary for many other visuallyrich documents, such as business forms, receipts, webpages, and newspapers. We hope our evaluation methodology engenders evaluations on a wider range of document types.
Our experiments involve a subset of the many layout-infused models proposed in recent work (e.g., Peng et al., 2022;Kim et al., 2021;Li et al., 2021). The models in our experiments were chosen because they share a similar model size and underlying architecture, facilitating comparisons between different methods of layout-infusion. We release our evaluation suite to enable more comprehensive evaluations in the future.
Performance drops occur both for layout-infused and, to a lesser extent, text-only models. The performance drops from text-only models may be due to layout information conveyed via word order and visual section boundary markers, but may also reflect shifts in text distribution. Our error analy-ses suggest that generalization errors are driven by shifts in layout rather than content (Section 5.5). In the future, synthetic experiments (e.g., with LaTeXbased manipulations) would help to fully disentangle the effects of layout and content distribution shifts, provided that large-scale synthetic manipulations can be constrained to produce realistic layouts.

Potential Risks
Although we do not foresee direct harms from this work, our work is related to automated processing of scientific documents. This line of study carries the risk of inaccurately processing documents and propagating false information about scientific findings.

Acknowledgements
This work was supported in part by NSF Grant 2033558. CC was supported in part by an IBM PhD Fellowship. and multimodal dataset for long range and layout-

Appendix Experiment Compute Details
All experiments were run on NVIDIA RTX A6000 GPUs. For each training condition, model training took around one day for LayoutLMv2 and Lay-outLMv3, and took a couple hours for the other three models (LayoutLM, BERT, and SciBERT).

Learning Rate Selection
To select the learning rate for each model, models were trained on each of three learning rates (1e-04, 1e-05, 1e-06), and the learning rate that produced the best performance on the dev set was selected. Learning rate selection was performed separately for each of the training stages: the initial stage of training on the larger set of out-of-distribution layouts, and few-shot fine-tuning on examples from the target distribution. Dev performance for each model is shown for the initial training stage (Table  6) and few-shot fine-tuning (

Dataset Details
Our set of new train-test splits, GROTOAP2-LDS, is an adaptation of data released in the GROTOAP2 dataset. GROTOAP2 is distributed under the CC-BY license. Further use of GROTOAP2-LDSshould attribute original dataset collection to (Tkaczyk et al., 2014). We refer the reader to (Tkaczyk et al., 2014) for details of the original data collection procedure.

Label remapping
Although a shared annotation procedure was used to label all papers in the GROTOAP2 dataset (Tkaczyk et al., 2014), some differences XML formatting for different publishers resulted in discrepancies between structural category labels used for different publishers. For instance, in the original GROTOAP2 dataset TITLE_AUTHOR labels are used for some publishers, whereas separate TITLE and AUTHOR labels are used for other publishers).
To account for minor annotation discrepancies between publishers as well as insufficient support for certain category labels in our dataset splits, we re-map the structural category tagset used in the original GROTOAP2 dataset. Our label re-maping is shown in Table 8.

Full results
We provide the test performance for each trial and episode in Tables 9, 10 , 13, 11, 14, 12, and 15.         A2. Did you discuss any potential risks of your work?

Section 8
A3. Do the abstract and introduction summarize the paper's main claims?
Section 1 A4. Have you used AI writing assistants when working on this paper?
Left blank.
B Did you use or create scientific artifacts?
We present a new set of train-set splits for Grotoap2. We describe these splits in Section 3.3.

B1. Did you cite the creators of artifacts you used?
Yes, we cite Grotoap2 throughout the paper, including in Section 3 (Evaluation Methodology).
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
Yes, we discuss the Grotoap2 license in the Appendix.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? Yes, we discuss the Grotoap2 license in the Appendix.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?
The examples in Grotoap2 are from scientific papers in the public domain; we did not perform further anonymization steps.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?
We did not provide documentation of the original Grotoap2 dataset, but we refer the reader to Tkaczyk et al., 2014 for details of the original data collection procedure.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results. For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.
We describe statistics of train-test splits in Section 3.3.

C Did you run computational experiments?
Section 4, Section 5 C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?
We report the computational cost and infrastructure in the Appendix.
The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.