PDF-to-Tree: Parsing PDF Text Blocks into a Tree

Yue Zhang, Zhihao Zhang, Wenbin Lai, Chong Zhang, Tao Gui, Qi Zhang, Xuanjing Huang


Abstract
In many PDF documents, the reading order of text blocks is missing, which can hinder machine understanding of the document’s content.Existing works try to extract one universal reading order for a PDF file.However, applications, like Retrieval Augmented Generation (RAG), require breaking long articles into sections and subsections for better indexing.For this reason, this paper introduces a new task and dataset, PDF-to-Tree, which organizes the text blocks of a PDF into a tree structure.Since a PDF may contain thousands of text blocks, far exceeding the number of words in a sentence, this paper proposes a transition-based parser that uses a greedy strategy to build the tree structure.Compared to parser for plain text, we also use multi-modal features to encode the parser state.Experiments show that our approach achieves an accuracy of 93.93%, surpassing the performance of baseline methods by an improvement of 6.72%.
Anthology ID:
2024.findings-emnlp.628
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10704–10714
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.628
DOI:
Bibkey:
Cite (ACL):
Yue Zhang, Zhihao Zhang, Wenbin Lai, Chong Zhang, Tao Gui, Qi Zhang, and Xuanjing Huang. 2024. PDF-to-Tree: Parsing PDF Text Blocks into a Tree. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10704–10714, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
PDF-to-Tree: Parsing PDF Text Blocks into a Tree (Zhang et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.628.pdf