Summaries as Captions: Generating Figure Captions for Scientific Documents with Automated Text Summarization

Good figure captions help paper readers understand complex scientific figures. Unfortunately, even published papers often have poorly written captions. Automatic caption generation could aid paper writers by providing good starting captions that can be refined for better quality. Prior work often treated figure caption generation as a vision-to-language task. In this paper, we show that it can be more effectively tackled as a text summarization task in scientific documents. We fine-tuned PEGASUS, a pre-trained abstractive summarization model, to specifically summarize figure-referencing paragraphs (e.g., “Figure 3 shows...”) into figure captions. Experiments on large-scale arXiv figures show that our method outperforms prior vision methods in both automatic and human evaluations. We further conducted an in-depth investigation focused on two key challenges: (i) the common presence of low-quality author-written captions and (ii) the lack of clear standards for good captions. Our code and data are available at: https://github.com/Crowd-AI-Lab/Generating-Figure-Captions-as-a-Text-Summarization-Task.


Introduction
In scientific documents, effective figure captions help readers understand complex figures like bar charts, line charts, or pie charts. These captions describe the images and often include necessary context from the document's full text (Durbin, 2004). Unfortunately, even published papers often have poorly-written captions. As per our analysis (Section 8.2), around 53.88% of line charts in arXiv cs.CL papers are found to be unhelpful for NLP readers. Automatic caption generation could aid paper writers by providing good starting captions that can be refined for better quality.
Previous research typically approached figure caption generation as a vision-to-language task, i.e., * Equal contribution.
creating captions based on the image. For instance, Hsu et al. (2021) used an end-to-end approach with CNN+RNN structures, which extracted feature representation from the image and converted it into caption text. Qian et al. (2021) took a slightly different approach: first understanding what is in the image, pulling out key information, and then using a preset template to create the caption. However, although achieving some success in synthetic data (Kahou et al., 2017;Kafle et al., 2018;Chen et al., 2020a;Zhu et al., 2021), these approaches often struggled to caption real-world figures. For example, Hsu et al. (2021)'s end-to-end approach, trained and tested using arXiv figures, achieved a BLEU-4 score of only 2.91.
In this paper, we argue that figure captioning in scientific documents can be more effectively tackled as a text-summarization task: The caption can be generated by summarizing the paragraphs mentioning the figure (as shown in Figure 1.) Scientific figures typically come with extensive text in the scientific document that can aid caption generation. Our analysis (Section 5) shows that, in arXiv, over 75% of words in figure captions can be aligned with the words in the paragraphs referencing those figures, which motivates our approach. The automatic evaluation shows that summarizing paragraphs referencing the figures results in better captions than prior vision-based methods. In a human evaluation by external domain experts, our best-performing model's captions were preferred over the original captions 46.67% of the time.
We further conducted an in-depth investigation focused on two key challenges: (i) the common presence of low-quality author-written captions and (ii) the lack of clear standards for good captions. Surprisingly, 53.88% of the author-written captions in our sample was deemed unhelpful. This has implications for the design of future captioning systems, underscoring the influence of data quality on captioning performance. The caption is generated by the model Pegasus P +O+B . The example shown in this figure is extracted from the paper (Doulaty et al., 2015).

Related Work
Prior figure captioning works can be broadly categorized into two approaches: caption generation (i) based on the image of the figure or (ii) based on the data chart underlying the figure.
Earlier image-based approaches focused on automated image understanding, which involved parsing images to extract the figure's key attributes and converting parsed data into captions, e.g., using predefined templates (Kahou et al., 2017;Kafle et al., 2018;Methani et al., 2020;Qian et al., 2021;Siegel et al., 2016). Recently, with the advance of deep learning, more works are adopting an end-to-end paradigm, generating captions straight from the neural representations of images (Mahinpei et al., 2022;Pelka et al., 2021;Hsu et al., 2021;Chen et al., 2019;Kantharaj et al., 2022;Chen et al., 2020a). Our work contrasts with prior studies by focusing on text to generate captions instead of visuals. To the best of our knowledge, no existing figure-caption datasets explicitly contain the figures' accompanying documents (Pelka et al., 2021;Hsu et al., 2021;Chen et al., 2019), as this task has generally been approached as a vision task. Most recently, a knowledge-augmented image captioning method that uses both image and text data was introduced (Yang et al., 2023), suggesting the potential of using text from documents. Some approaches generate captions using the underlying tabular data of a figure rather than the figure's image. Earlier approaches often employed rule-based techniques (Corio and Lapalme, 1999;Demir et al., 2008;Fasciano and Lapalme, 1996;Mittal et al., 1998), while newer ones favor learning-based methods (Barzilay and Lapata, 2005;Wiseman et al., 2017;Moraes et al., 2014;Zhu et al., 2021;Kantharaj et al., 2022;Obeid and Hoque, 2020;Reiter et al., 2005;Parikh et al., 2020;Chen et al., 2020b;Gong et al., 2019;Su et al., 2021;Chen et al., 2020c). Despite these approaches' ability to utilize tabular and meta data, they necessitate access to the figure's raw data. Contrarily, our work uses the rich textual information in scientific documents to generate captions.

Problem Statement and Terminology
A document D contains n figures, F 1 to F n , where F i has a caption C i that was written by the document author. In document D, j sentences, M i,1 to M i,j , explicitly mention F i (e.g., "As shown in F i ..."). The objective of this work is to automatically generate a high-quality caption, C ′ i , for figure F i using only its mentions (M i,1 to M i,j ) and the surrounding text of the mentions in document D.
In the rest of the paper, we use these terms: • A "Mention" refers to a sentence in a document that explicitly mentions the target figure, e.g., "As shown in Figure 6..." If there are multiple Mentions, the first Mention is referred to.
• A "Paragraph" refers to a section of text containing a Mention. In this work, the boundaries of a Paragraph are determined by the <p> tag produced by PDF parsing.
• Sentences near a Mention may contain relevant information, so we extracted n preceding sentences and m following sentences to form the "Window[n, m]" text snippet. For instance, "Window[1, 2]" refers to a snippet of four sentences, including one preceding sentence, the Mention sentence, and two following sentences.
• An "OCR" refers to the textual information (e.g., legends, labels, etc.) extracted from the image, by optical character recognition (OCR) software.

Dataset
Before diving into our experiments and analyses, we first describe the dataset upon which our study is grounded. Our results are based on a scientific figure caption dataset, SCICAP, and several preprocessing steps to fit it into our workflow. SCICAP is a dataset that contains over 416,000 line charts and captions extracted from more than 290,000 arXiv papers (Hsu et al., 2021). It was one of the first large-scale figure-captioning datasets based on real-world scientific figures. However, it does not contain the paragraphs that mention the figure. To address this, we downloaded all the PDF files of the original arXiv papers used in SCI-CAP and extracted all the Mentions and Paragraphs as outlined in Section 6.1. Detailed information on preprocessing, including the dataset resplit and OCR extraction, are described in Appendix B.

Motivating Analysis
To understand the correlation between mentions and captions, we performed a series of analyses using the data described in Section 4. Specifically, we investigated the extent to which the words in the figure captions are represented in the corresponding figure-mentioning paragraphs. We used awesome-align (Dou and Neubig, 2021) to obtain the alignment between the source texts (mentions, paragraphs, and OCRs) and captions. Awesomealign compared the similarity of the words' contextual embeddings and assigned an alignment between words if the similarity passed a threshold. We used SciBERT (Beltagy et al., 2019) to obtain contextual embeddings and softmax threshold = 0.99 to reduce false alignments.
After obtaining the alignments, we computed what percentage of information in the caption could be found in the source texts. The results shown in Table 1 indicate that 76.68% of the caption's information could be found in Paragraph and OCR, motivating us to generate figure captions by summarizing Paragraph. We also observed that a randomly selected sentence and paragraph from the same paper can cover 35.23% and 44.43% of the caption, respectively, showing that there was some generic information-sharing across the paper. We also conducted a study using the exact overlapping (i.e., BLEU score) in Appendix A.

Generating Figure Captions as a Text
Summarization Task Figure 1 overviews the proposed pipeline. This section describes each step of the pipeline.

Extracting Mentions and Paragraphs
The system first extracts Mentions and their associated Paragraphs (as defined in Section 3.) In this paper, we used Grobid (kermitt2, 2022), a publicly-available tool for converting PDF files into structured XML documents, to extract plain text from the paragraphs (including the <p> tags) in each paper. This plain text was then segmented into sentences using BlingFire (microsoft, 2022). We developed regular expressions to identify sentences mentioning specific figures. For instance, sentences such as "As shown in Figure 6, ..." were first identified and then linked to Figure 6. To assess the performance of these regular expressions, we conducted a manual evaluation of 300 samples from our experimental dataset. The results showed a high level of precision (99.58%) and recall (94.44%).

Generating Captions Using Text Summarization Models
As shown in Figure 1, our system then automatically summarizes all the extracted Mentions (or Paragraphs) into a figure caption. In this work, we used PEGASUS, an abstractive summarization model (Zhang et al., 2020), and finetuned it on our dataset. Five Pegasus models, Pegasus M , Pegasus P , Pegasus O , Pegasus M +O , and Pegasus P +O , were trained utilizing five distinct input combinations, including (i) Mention, (ii) Paragraph, (iii) OCR output of the target figure image, (iv) Mention+OCR, and (v) Para-graph+OCR. Pegasus P +O encompasses the most of relevant information in the document and thus is expected to yield the optimal summary. Additionally, we built Pegasus P +O+B , a specialized version of the model designed to be trained on a subset of higher-quality captions, (vi) Paragraph+OCR-Better. Given the absence of reliable automated ways to assess the quality of captions, we followed a guideline from previous studies indicating that longer captions enhance reader comprehension (Hartley, 2003;Gelman et al., 2002). We trained the model using captions with 30 or more tokens. The average caption length was 26.8 tokens, so we set 30 tokens as the threshold. The training was performed using Paragraph+OCR inputs.
We identified two major challenges in generating captions for scientific figures in real-world scenarios. We discuss these challenges in the following subsections, with an in-depth analysis in Section 8.

Challenge 1: Addressing Unreliable Quality of Real-World Data
Low-quality captions often occur in scholarly articles. Our analysis (see Section 8.1) showed that 50% of line charts' author-written captions in arXiv cs.CL papers were deemed unhelpful by domain experts. The impact of this unreliable data quality is that developers could train and test captioning models with unhelpful captions. The lack of automatic methods for evaluating caption quality makes it hard to identify suitable training examples and eliminate poor ones. To address this issue, we included Pegasus P +O+B that was trained on longer captions, which is suggested by literature to be more helpful to readers (Hartley, 2003;Gelman et al., 2002). To account for low-quality test data, we conducted both human and automatic evaluations. The data quality of figure captions was analyzed and is presented in Section 8.2.

Challenge 2: Defining a Clear Standard for "Good" Figure Captions
The deeper issue is the lack of a set of well-defined and actionable criteria for determining the usefulness of a figure caption. Although there are guidelines for writing effective scientific figure captions (Rougier et al., 2014;Biegel and Kamat, 2019), their translation into algorithmic models can be challenging. From a modeling standpoint, the lack of a clear goal presents a challenge, as it is uncertain what to optimize for once fluency has been achieved. In this paper, we focus on demonstrating the feasibility of generating captions via text summarization. Although we did not incorporate specialized goals in the model, we examine the criteria for a "good" caption in Section 8.2.

Experimental Results
A Simple Baseline: Using Extracted Mentions as Captions. Motivated by our information overlap study (Section 5), we created the Reuse baselines. These baselines simply repurpose portions of the input text as the prediction.
Vision-to-Language Baselines. The vision-tolanguage generation treated this task as an imagecaptioning task that took the scientific figure image as input and generated a text to describe it. We compared two vision-to-language models as baselines. First, we built a sequence-to-sequence model by combining BEiT (Bao et al., 2022) and GPT-2 (Radford et al., 2019). We also selected the TrOCR     For example, when the predicted text is shorter than 50 tokens, predicting longer texts generally results in a higher ROUGE-2 score. The normalized scores indicate the proposed system's performance gain over the random baseline of the same length. Pegasus P +O+B and Reuse M get closer to TrOCR after normalization, suggesting the need for normalization for accurate interpretation of results.
2004; Nallapati et al., 2016), MoverScore , and BERTScore for automatic evaluation. When computing ROUGE scores using rouge-score (google research, 2022), we turned all text into lower case and stem words. As both MoverScore and BERTScore are based on the semantic similarity, we obtained contextual embeddings from SciBERT (Beltagy et al., 2019).
Automatic Evaluation with Normalization Over Caption Length. ROUGE F1 tends to favor longer texts within a certain length, leading to a skewed comparison where models generating longer texts receive higher scores (Sun et al., 2019). We followed Sun et al. (2019)'s approach of normalizing the scores with the corresponding random baseline that generates texts of the same length.
where length is the average length of the texts generated by the target system. We estimated Random(length) by applying linear interpolation on several (length, random score) pairs. The (length, random score) pairs were obtained by randomly selecting a certain number of sentences (1, 2, ..., 10 sentences) from the input paragraph as the prediction. To get random scores of texts shorter than a single sentence (around 30 tokens), we truncated sentences to the desired length (4, 6, ..., 30 tokens). For each length setting, we ran 10 different random seeds and reported the average. The Random line in  Table 2 shows the normalized automatic evaluation results. Overall, Pegasus P +O , the textsummarization model with all available information (Paragraph+OCR), achieved the best performance in all three metrics. Pegasus P +O+B , the model using the same information but trained on a better subset of captions (Paragraph+OCR-Better), did not perform well. We hypothesized this was due to half of the test data comprising poor captions (refer to Section 8.2). This was validated by examining performance shifts in different quality beams (Section 8.1) and conducting a human evaluation (Section 7.2). Meanwhile, Reuse M , the Reuse baseline with Mention, outperforms other Reuse baselines. Its performance declined as context sizes grew and shifted.

Human Evaluation Results
Pilot MTurk Study to Select Top Models. Before the main human evaluation, we ran a pilot study on Amazon Mechanical Turk (MTurk) to identify any apparently underperforming baselines for exclusion in the final study, simplifying the main human evaluations. In this study, we asked MTurk workers to carefully read a figure and select the worst figure caption among (i) TrOCR, (ii) Pegasus P +O , (iii) Pegasus P +O+B , and (iv) groundtruth caption. Ninety figures without errors were randomly sampled from our annotated set (i.e., figures from cs.CL arXiv papers in Section 8.2) for the study. For each of the figures, we recruited 20 MTurk workers to judge. 2 We report the number of majority votes (when tied, we counted all captions with the highest votes as the worst) and the average number of votes in Table 3. Results indicated that TrOCR's caption won the majority vote 41 out of 90 times, with its average vote count significantly exceeding others. Hence, we excluded TrOCR from our formal human evaluation.

Main Human Evaluation with Domain Experts.
Three Ph.D. students with NLP backgrounds (who are not coauthors) were recruited as human judges, as it is hard for those without basic domain understanding to evaluate captions. This study has been approved by the IRB office of the authors' institute. The same 90 figures used in the pilot 2 Four MTurk qualifications were used: Locale (US Only), HIT Approval Rate (≥98%), Number of Approved HITs (≥3000), and the Adult Content Qualification. The payment for each task was set to 0.09 (hourly wage = $10 dollars).   MTurk study were used again. We asked the human judges to compare each figure's (i) Pegasus P +O , (ii) Pegasus P +O+B , and (iii) ground-truth caption. The judges were asked to rank the captions based on how strongly they agreed with this statement: "When I read the paper, this caption can help me understand the message that the figure tries to convey." Figure 5 (see Appendix D) shows the interface the human judges used. Table 4 shows the results of average ranking (from 1 to 3). Overall, the ground-truth caption and Pegasus P +O+B were ranked similarly (1.919 vs. 1.930 with p-value = 0.923). Humans also favored Pegasus P +O+B over Pegasus P +O significantly (1.919 vs. 2.152 with p-value = 0.016). This supports our heuristic for automatically determining caption quality based on length and aligns with previous findings that longer captions improve reader comprehension (Hartley, 2003;Gelman et al., 2002). However, we found that the task of caption ranking poses a challenge, as evidenced by the lower correlations between raters, with Kendall's tau values of 0.133, 0.148, and 0.274, and Spearman's rho values of 0.128, 0.156, and 0.317. This highlights the complexity of the task and suggests that scaling human evaluation across domains might be difficult. Different preferences over captions, such as length, could lead to lower agreement among raters.  Table 5: Results of the manual annotation. More than 50% of the captions were annotated as unhelpful. (Out of the initial 438 figure captions, we excluded those with extraction or classification errors, e.g., incomplete images, leaving us with only 399 captions.)

In-Depth Analysis
We conducted an in-depth investigation focused on two key challenges: (i) the common presence of low-quality author-written captions and (ii) the lack of clear standards for good captions.
Quality Annotation Procedure. We manually annotated 438 captions in the Computation and Language domain (cs.CL) from the test set. Figure 6 (see Appendix D) shows the interface we used, in which the title, abstract, and PDF file of the paper were shown alongside the target figure's image, caption, and questions. For each caption, we asked the annotators (coauthors) to rate four aspects using a five-point Likert scale: • Image-Text. The caption included named entities or important words/numbers in the figure (e.g., title, legends, labels, etc.). • Visual-Description. The caption included some visual characteristics of the figure (e.g., color, shape, trend, etc.). • Takeaway. The caption explicitly stated the highlevel takeaway message or the conclusion that the figure attempted to convey. • Helpfulness. "The caption helped me understand the message that the figure attempted to convey".
The annotated data was consolidated by grouping "Strongly Agree" and "Agree" as "[Agree]" and grouping "Neutral", "Disagree", and "Strongly Disagree" as "[Disagree]". The results of this consolidation are presented in Table 5.  evaluated models on different quality beams using the 399 annotated figure captions shown in Table 5. The captions were divided into the "helpful beam" (184 captions rated [Agree]) and the "unhelpful beam" (215 captions rated [Disagree]).

Automatic Evaluation Over Beams of Different
Quality. To validate the effect of low-quality captions, we re-performed the automatic evaluation for the helpful and unhelpful beam sets. Figure 3 shows the Normalized ROUGE-2 and MoverScore scores for each model in the helpful and unhelpful beam sets. 3 Most models performed better in the unhelpful beam, except Pegasus P +O+B , which had better scores in the helpful beam. Pegasus P +O+B was trained on captions with more than 30 tokens. This result suggests that improving training data quality, such as by using only longer captions, can positively impact the model's behavior and result in a better generation of helpful captions.
Human Evaluation Over Beams of Different Quality. We also re-evaluated human scores for both the helpful and unhelpful beams. The human evaluation in Section 7.2 only covered 90 figures, with 55 in the helpful beam and 35 in the unhelpful beam. Table 6 shows the results. On average, Pegasus P +O+B (1.867) was ranked better than author-written captions (2.019) in the unhelpful beam, in which machine-generated captions were preferred by human judges 22 out of 35 times. The results suggest that, with careful training data quality control, when author-written captions are not very helpful, machines could potentially generate better captions.

Challenge 2: What Constitutes a Good Figure Caption?
We calculated Pearson correlations (Rodgers and Nicewander, 1988) among the four aspects using raw five-point Likert ratings. The results are shown in Table 7. The highest correlation was found between Takeaway and Helpfulness, suggesting that a high-quality caption accurately captures   the main message of the figure. There were also strong correlations between Helpfulness, Visual-Description, and Takeaway, indicating that a good caption effectively conveys visual information and summarizes the main message. However, Table 5 shows that only 16.04% and 18.55% of the captions described the visual characteristics and the takeaway message, respectively. A moderate correlation between Helpfulness and Length supports previous research findings that longer captions are generally more helpful for readers (Hartley, 2003;Gelman et al., 2002).

Caption Length Distribution
Throughout this work's development, the length of captions emerged as a consistent issue. Despite existing literature indicating the benefits of longer captions for readers (Hartley, 2003;Gelman et al., 2002), space limitations often leave authors with no option but to craft shorter captions. To shed some light on this aspect and offer insight for future research, we analyzed the lengths of both authorcreated and machine-generated captions. We used Kernel Density Estimate (KDE) plots to investigate the distribution of caption lengths across different models and domains. As shown in Figure 4a, the majority of models demonstrate a common peak at 10 tokens, while Pegasus P +O+B presents a significant deviation with a peak near 30 tokens. Figure 4b presents the distribution of helpfulness  scores, derived from quality annotation data (see Section 8.2). Captions rated with a maximum helpfulness score of 5 show a peak at 35 tokens. We can also see a clear shift in caption length with higher scores. In Figure 4c, we dug into the top 10 category taxonomy from arXiv. This figure suggests that a higher portion of the captions in cs, math, stat, and eess are shorter (10 tokens); while the rest of the categories (cond-mat, quant-ph, q-bio, etc) have higher probabilities for longer captions. However, within the cs domain (Figure 4d), the top 10 subcategories do not show significant differences regarding caption length distribution.

Discussion
Is Text Really All You Need? Our results demonstrate that summarizing figure-mentioning paragraphs is sufficient to generate captions, as shown by the similar scores of Pegasus P and Pegasus P +O in Table 2. Adding OCR had limited impact. Furthermore, in a recent study of scientific figure captioning conducted by Yang et al. (Yang et al., 2023), the best-performing model only considered figurementioning paragraphs and OCR tokens-note that their OCR tokens were visual features-without taking the figure's imagery into account. These results raise an interesting question: Do we need visual information at all? What for? The token alignment study (Section 5) showed that 75.19% of the caption information could be found in the Paragraphs, meaning 24.81% of the information was missing. Understanding this missing information could help improve the models' performance.  Thus, we calculated the correlation between the amount of missing information and three aspect ratings (image-text, visual-description, and takeaway) in the quality annotation data (Section 8.2). The missing information was quantified as the number or percentage of tokens without aligning to any tokens in figure-mentioning paragraphs. Table 8 demonstrates a positive correlation between the extent of missing information and visual descriptions and takeaway messages. This suggests that incorporating visual descriptions (e.g., "dashed line," "red line") is key to enhancing performance by filling in the gaps in information not covered by the article's text. Furthermore, the strong correlation between Helpfulness and Visual-Description in Table 5 also indicates that including image information is necessary for writing good captions. It should be noted that OCR is only capable of capturing image texts (e.g., labels, legends) and not visual element information (e.g., "dashed line"). A promising future direction is developing a multimodal model that can effectively incorporate both image and text.
What is the Best Length for Captions? Our research indicates that filtering shorter captions can facilitate the generation of more helpful captions. However, the resulting captions tend to be longer than usual, as shown in the Pegasus P +O+B shift to the right in Figure 4a. This raises a question: Is it fair to compare short and long captions on usefulness, given that longer captions inherently contain more information? While our automatic evaluation addressed this by implementing length normalization, our human evaluations and quality annotations did not specifically instruct the annotators to consider caption lengths. Nevertheless, we argue that even if we asked annotators to consider caption lengths while identifying helpful captions, the "ideal" caption length would differ among annotators due to multiple factors. For example, as shown in Figure 4c, the length distributions of captions vary across domains. The low inter-agreement from our human evaluation (see Section 7.2) also suggests that personal preferences could influence ideal caption length (Lundgard and Satyanarayan, 2021). Moreover, the ideal length could also be dictated by the context: writers might favor shorter captions due to page constraints, while readers might prefer longer but informative ones (Stokes and Hearst, 2022;Sun et al., 2019). To tackle this issue, a potential future direction could be enabling models to generate captions of diverse lengths to suit different users and contexts.

Conclusion and Future Work
This work presented a new perspective on automatic figure captioning, demonstrating that a language-based approach, i.e., summarizing figurereferring paragraphs, can outperform conventional vision-based methods. Our analysis further showed many unhelpful captions in arXiv papers, highlighting data quality's impact on captioning performance. This work lays the groundwork for further research, including exploring new data selection, revision, and augmentation strategies to mitigate the effects of low-quality data, developing new evaluation methods, and creating more robust models that better handle noisy data. We also aim to expand the technology's scope to cover a wider variety of figures and article types.

Limitations
Although our proposed methods have been shown to be effective, we are aware of several limitations. First, our approach requires mentions in order to produce captions, but it is not always easy to automatically identify the mentions for a given figure in real-world data. There were 18.81% of figures in the original SCICAP that did not have any identified mentions, which we excluded from this work. Many factors contributed to the gap, including errors caused by upstream components such as image extraction or image type classification (e.g., table), unexpected figure index formats (e.g., "Figure VIII", " Figure C·1"," Fig.Fig. 4(b)"), PDF parsing errors, or the figure never being mentioned in the paper. Second, our method uses texts instead of images as the primary information source, so, naturally, it inherits all the constraints of text. Our method can not capture any visual element in the figure that the text never mentioned; it struggles when the text is poorly written. Finally, this paper focused on non-compound line charts in arXiv papers; the human evaluation only focused on NLP papers. More research is needed to examine the generalizability.

Ethics Statement
We consider the proposed technology to impose little risk to readers, as it only summarizes what has already been presented in the paper. However, when the generated caption contains some inaccurate information, it could mislead readers. Furthermore, the proposed technology has the nature of neglecting visual content, which might have an impact on the accessibility of figure captions.

A Token Overlap Study
This is the additional study we conducted to support Section 5. We computed n-gram precision scores (BLEU-4) for mentions and captions extracted from different settings. For mentions, we included (i) First Mention, the sentence that first mentions the figure in the paper; (ii) Random Mention, a randomly selected sentence among all mentions; and (iii) Random Sentence, a randomly selected sentence from the paper. We also included one or two following sentences, as the surrounding context may contain relevant information for the figure. For captions, we examined: (i) First Caption, the first sentence of the caption; and (ii) Whole Caption, all the sentences in the caption. The results are shown in

B Data Preprocessing Details
We describe the detailed data preprocessing steps here as supplementary materials for Section 4.   Table 10: N-gram matching (BLEU-4) between captions and mentions of each figure. First and Whole refer to the first sentence of the caption and the whole caption, respectively. Context means the number of the following sentences included. First Mention was better than Random Mention in the corresponding settings, suggesting that writers may give more details when first introducing the figure. the OCR texts into our models, we concatenated them with the sequence of coordinates obtained by traversing the bounding boxes from left to right and then from top to bottom.
Representative. It is worth noting that we manually verified 399 figures (the set used in Section 8) and found that 81.2% (324/399) was published at academic conferences, and 51.9% (207/399) were at ACL Anthology, IEEE, or ACM, suggesting that the data is representative.

C Training and Decoding Details
We describe the model training details and the decoding configuration used in Section 7.
Training Details for Text Summarization Models. We fine-tuned Pegasus 5 for the textsummarization task using HuggingFace's implementation (Wolf et al., 2020). All the models shared the same training hyper-parameters except maximum text length, as the data varies in all the examined settings. The maximum source length and target length were set to (i) fully cover at least 95% of text without truncation and (ii) be able to fit into the machine. We show the length configuration in Table 9. Other hyper-parameters used for training were batch size = 32, learning rate = 5e-5 with a linear decay scheduler, and number of training epochs = 200. We evaluated the model every five epochs, and the one with the highest ROUGE-2 score was kept for testing (Liu and Liu, 2021;Zhong et al., 2020;. All models were trained with an NVIDIA A100 GPU. Each model took one to three days to train.
Training Details for Vision-to-Language Models. Two vision-to-language models were fine-tuned using HuggingFace: (i) a sequence-to-sequence model using BEiT 6 and GPT-2 7 and (ii) TrOCR. 8 The hyperparameters used for training were maximum target length = 100, learning rate = 2e-5 with a linear warmup (one epoch), and linear decay scheduler. Batch sizes were 32 and 64, respectively. The models were trained using AdamW (Loshchilov and Hutter, 2019), with weight decay = 1e-4 for 100 epochs. We evaluated the model every epoch and kept the one with the highest ROUGE-2 score (Liu and Liu, 2021;Zhong et al., 2020;. The model was trained with an NVIDIA A100 GPU for two days. Decoding. For all generation models, captions were decoded using the beam sampling strategy, with beam size = 5, temperature = 0.8, top-k = 100, repetition penalty = 3.0, minimal length = 10, and maximum length = 100. Figure 5 shows the interface the human judges used to rank the captions (see Section 7.2). The paper's title (without linking to the paper's URL) and abstract are shown. The human judges can drag the captions (each displayed with the figure) on the left pane and drop them to the right pane to rank them. The initial display order of the captions is randomized on the interface. We did not display the paper's PDF or link to the paper's URL to prevent human judges from biasing toward the author-written captions. Figure 6 shows the interface we used to rate the usefulness of captions (see Section 8.2.) The title (with a hyperlink to the paper's URL), abstract, and the PDF file of the paper were shown, alongside the target figure's image/caption and all the questions. 6 We used microsoft/beit-large-patch16-384. 7 We used gpt2-large. 8 We used microsoft/trocr-large-printed.

D Interfaces
We displayed the paper's PDF to help raters make more informed decisions on the caption quality.

E Additional Experimental Results
In this section, we show all the additional experimental results mentioned in the experiment and analysis.
Normalization Scores. Figures 8 to 11 shows the relationship between generation text length and performance (ROUGE-1, ROUGE-L, MoverScore, and BERTScore). The random lines indicate that the text length and the performance are not independent, suggesting that normalization over text length is needed. Table 11 shows the corresponding random scores for each of the metrics used in Table 2.
Examples. Figure 7 shows two samples of generation output. The information generated by Pegasus P +O+B could be helpful (A), but it could also introduce factual errors (B).
Performance in Different Quality Beams. Figure 12 shows the ROUGE-1 and ROUGE-L changes in beams of different quality. We can see findings similar to Section 8.1 where among different generation models, only the one trained with data quality control (i.e., Pegasus P +O+B ) performed better in the helpful beam, generating captions more similar to helpful captions.

Feature Length
Rouge-1 (F1) Rouge-2 (F1) Rouge-L (F1) WMS BERTScore  Table 11: Random scores corresponding to the length for each automatic evaluation metric.   The relationship between average text length and ROUGE-1. When the generated text is shorter than 50 tokens, longer texts generally result in a higher ROUGE-1 score. Figure 9: The relationship between average text length and ROUGE-L. When the generated text is shorter than 50 tokens, longer texts generally result in a higher ROUGE-L score. Figure 10: The relationship between average text length and MoverScore. When the generated text is shorter than 30 tokens, longer texts generally result in a higher MoverScore score. Figure 11: The relationship between average text length and BERTScore. When the generated text is shorter than 30 tokens, longer texts generally result in a higher BERTScore score. Figure 12: Normalized ROUGE-1, ROUGE-L, and BERTScore for beams of different quality. Most of the generative models (Pegasus, BEiT+GPT2, and TrOCR) performed better in the unhelpful beam, suggesting that they may be better at generating bad captions. Only the model trained with better captions (Pegasus P +O+B ) learned to generate good captions by showing a much better score in the helpful beam. Note that though Pegasus O also performs better in the helpful beam, the difference is subtle.