Multimodal Chart Retrieval: A Comparison of Text, Table and Image Based Approaches

Averi Nowak, Francesco Piccinno, Yasemin Altun


Abstract
We investigate multimodal chart retrieval, addressing the challenge of retrieving image-based charts using textual queries. We compare four approaches: (a) OCR with text retrieval, (b) chart derendering (DePlot) followed by table retrieval, (c) a direct image understanding model (PaLI-3), and (d) a combined PaLI-3 + DePlot approach. As the table retrieval component we introduce Tab-GTR, a text retrieval model augmented with table structure embeddings, achieving state-of-the-art results on the NQ-Tables benchmark with 48.88% R@1. On in-distribution data, the DePlot-based method (b) outperforms PaLI-3 (c), while being significantly more efficient (300M vs 3B trainable parameters). However, DePlot struggles with complex charts, indicating a need for improvements in chart derendering - specifically in terms of chart data diversity and the richness of text/table representations. We found no clear winner between methods (b) and (c) in general, with the best performance achieved by the combined approach (d), and further show that it benefits the most from multi-task training.
Anthology ID:
2024.naacl-long.307
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5488–5505
Language:
URL:
https://aclanthology.org/2024.naacl-long.307
DOI:
Bibkey:
Cite (ACL):
Averi Nowak, Francesco Piccinno, and Yasemin Altun. 2024. Multimodal Chart Retrieval: A Comparison of Text, Table and Image Based Approaches. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5488–5505, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Multimodal Chart Retrieval: A Comparison of Text, Table and Image Based Approaches (Nowak et al., NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-long.307.pdf
Copyright:
 2024.naacl-long.307.copyright.pdf