Encoding Spreadsheets for Large Language Models

Haoyu Dong; Jianbo Zhao; Yuzhang Tian; Junyu Xiong; Mengyu Zhou; Yun Lin; José Cambronero; Yeye He; Shi Han; Dongmei Zhang

doi:10.18653/v1/2024.emnlp-main.1154

Encoding Spreadsheets for Large Language Models

Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Junyu Xiong, Mengyu Zhou, Yun Lin, José Cambronero, Yeye He, Shi Han, Dongmei Zhang

Abstract

Spreadsheets are characterized by their extensive two-dimensional grids, flexible layouts, and varied formatting options, which pose significant challenges for large language models (LLMs). In response, we introduce SheetEncoder, pioneering an efficient encoding method designed to unleash and optimize LLMs’ powerful understanding and reasoning capability on spreadsheets. Initially, we propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach was limited by LLMs’ token constraints, making it impractical for most applications. To tackle this challenge, three innovative modules are proposed to compress spreadsheets effectively: structural-anchor-based compression, inverse index translation, and data-format-aware aggregation. It significantly improves performance in spreadsheet table detection task, outperforming the vanilla approach by 25.6% in GPT4’s in-context learning setting. Moreover, fine-tuned LLM with SheetEncoder has an average compression ratio of 25×, but achieves a state-of-the-art 78.9% F1 score, surpassing the best existing models by 12.3%, demonstrating that SheetEncoder greatly boosts LLMs’s performance on spreadsheet data.

Anthology ID:: 2024.emnlp-main.1154
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20728–20748
Language:
URL:: https://aclanthology.org/2024.emnlp-main.1154/
DOI:: 10.18653/v1/2024.emnlp-main.1154
Bibkey:
Cite (ACL):: Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Junyu Xiong, Mengyu Zhou, Yun Lin, José Cambronero, Yeye He, Shi Han, and Dongmei Zhang. 2024. Encoding Spreadsheets for Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20728–20748, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Encoding Spreadsheets for Large Language Models (Dong et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.1154.pdf
Software:: 2024.emnlp-main.1154.software.zip
Data:: 2024.emnlp-main.1154.data.zip

PDF Cite Search Software Data Fix data