Wenbin Lai
2024
PDF-to-Tree: Parsing PDF Text Blocks into a Tree
Yue Zhang
|
Zhihao Zhang
|
Wenbin Lai
|
Chong Zhang
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2024
In many PDF documents, the reading order of text blocks is missing, which can hinder machine understanding of the document’s content.Existing works try to extract one universal reading order for a PDF file.However, applications, like Retrieval Augmented Generation (RAG), require breaking long articles into sections and subsections for better indexing.For this reason, this paper introduces a new task and dataset, PDF-to-Tree, which organizes the text blocks of a PDF into a tree structure.Since a PDF may contain thousands of text blocks, far exceeding the number of words in a sentence, this paper proposes a transition-based parser that uses a greedy strategy to build the tree structure.Compared to parser for plain text, we also use multi-modal features to encode the parser state.Experiments show that our approach achieves an accuracy of 93.93%, surpassing the performance of baseline methods by an improvement of 6.72%.
2023
Characterizing the Impacts of Instances on Robustness
Rui Zheng
|
Zhiheng Xi
|
Qin Liu
|
Wenbin Lai
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
|
Jin Ma
|
Ying Shan
|
Weifeng Ge
Findings of the Association for Computational Linguistics: ACL 2023
Building robust deep neural networks (DNNs) against adversarial attacks is an important but challenging task. Previous defense approaches mainly focus on developing new model structures or training algorithms, but they do little to tap the potential of training instances, especially instances with robust patterns carring innate robustness. In this paper, we show that robust and non-robust instances in the training dataset, though are both important for test performance, have contrary impacts on robustness, which makes it possible to build a highly robust model by leveraging the training dataset in a more effective way. We propose a new method that can distinguish between robust instances from non-robust ones according to the model’s sensitivity to perturbations on individual instances during training. Surprisingly, we find that the model under standard training easily overfits the robust instances by relying on their simple patterns before the model completely learns their robust features. Finally, we propose a new mitigation algorithm to further release the potential of robust instances. Experimental results show that proper use of robust instances in the original dataset is a new line to achieve highly robust models.
Search
Co-authors
- Tao Gui 2
- Qi Zhang 2
- Xuan-Jing Huang 2
- Rui Zheng 1
- Zhiheng Xi 1
- show all...