Dataset Construction for Scientific-Document Writing Support by Extracting Related Work Section and Citations from PDF Papers

Keita Kobayashi, Kohei Koyama, Hiromi Narimatsu, Yasuhiro Minami


Abstract
To augment datasets used for scientific-document writing support research, we extract texts from “Related Work” sections and citation information in PDF-formatted papers published in English. The previous dataset was constructed entirely with Tex-formatted papers, from which it is easy to extract citation information. However, since many publicly available papers in various fields are provided only in PDF format, a dataset constructed using only Tex papers has limited utility. To resolve this problem, we augment the existing dataset by extracting the titles of sections using the visual features of PDF documents and extracting the Related Work section text using the explicit title information. Since text generated from the figures and footnotes appearing in the extraction target areas is considered noise, we remove instances of such text. Moreover, we map the cited paper’s information obtained using existing tools to citation marks detected by regular expression rules, resulting in pairs of cited paper information and text of the Related Work section. By evaluating body text extraction and citation mapping in the constructed dataset, the accuracy of the proposed dataset was found to be close to that of the previous dataset. Accordingly, we demonstrated the possibility of building a significantly augmented dataset.
Anthology ID:
2022.lrec-1.609
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5673–5682
Language:
URL:
https://aclanthology.org/2022.lrec-1.609
DOI:
Bibkey:
Cite (ACL):
Keita Kobayashi, Kohei Koyama, Hiromi Narimatsu, and Yasuhiro Minami. 2022. Dataset Construction for Scientific-Document Writing Support by Extracting Related Work Section and Citations from PDF Papers. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5673–5682, Marseille, France. European Language Resources Association.
Cite (Informal):
Dataset Construction for Scientific-Document Writing Support by Extracting Related Work Section and Citations from PDF Papers (Kobayashi et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.609.pdf