TableBank: Table Benchmark for Image-based Table Detection and Recognition

Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, Zhoujun Li


Abstract
We present TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet. Existing research for image-based table detection and recognition usually fine-tunes pre-trained models on out-of-domain data with a few thousand human-labeled examples, which is difficult to generalize on real-world applications. With TableBank that contains 417K high quality labeled tables, we build several strong baselines using state-of-the-art models with deep neural networks. We make TableBank publicly available and hope it will empower more deep learning approaches in the table detection and recognition task. The dataset and models can be downloaded from https://github.com/doc-analysis/TableBank.
Anthology ID:
2020.lrec-1.236
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1918–1925
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.236
DOI:
Bibkey:
Cite (ACL):
Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. 2020. TableBank: Table Benchmark for Image-based Table Detection and Recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1918–1925, Marseille, France. European Language Resources Association.
Cite (Informal):
TableBank: Table Benchmark for Image-based Table Detection and Recognition (Li et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.236.pdf
Code
 doc-analysis/TableBank