Word-Level Alignment of Paper Documents with their Electronic Full-Text Counterparts

Mark-Christoph Müller, Sucheta Ghosh, Ulrike Wittig, Maja Rey


Abstract
We describe a simple procedure for the automatic creation of word-level alignments between printed documents and their respective full-text versions. The procedure is unsupervised, uses standard, off-the-shelf components only, and reaches an F-score of 85.01 in the basic setup and up to 86.63 when using pre- and post-processing. Potential areas of application are manual database curation (incl. document triage) and biomedical expression OCR.
Anthology ID:
2021.bionlp-1.19
Volume:
Proceedings of the 20th Workshop on Biomedical Language Processing
Month:
June
Year:
2021
Address:
Online
Venues:
BioNLP | NAACL
SIG:
SIGBIOMED
Publisher:
Association for Computational Linguistics
Note:
Pages:
168–179
Language:
URL:
https://aclanthology.org/2021.bionlp-1.19
DOI:
10.18653/v1/2021.bionlp-1.19
Bibkey:
Cite (ACL):
Mark-Christoph Müller, Sucheta Ghosh, Ulrike Wittig, and Maja Rey. 2021. Word-Level Alignment of Paper Documents with their Electronic Full-Text Counterparts. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 168–179, Online. Association for Computational Linguistics.
Cite (Informal):
Word-Level Alignment of Paper Documents with their Electronic Full-Text Counterparts (Müller et al., BioNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.bionlp-1.19.pdf
Code
 nlpAThits/BioNLP2021