LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP

Danlu Chen, Freda Shi, Aditi Agarwal, Jacobo Myerston, Taylor Berg-Kirkpatrick


Abstract
Standard natural language processing (NLP) pipelines operate on symbolic representations of language, which typically consist of sequences of discrete tokens. However, creating an analogous representation for ancient logographic writing systems is an extremely labor-intensive process that requires expert knowledge. At present, a large portion of logographic data persists in a purely visual form due to the absence of transcription—this issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representations of language offers a potential solution. We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages, featuring both transcribed and visual datasetsfor four writing systems along with annotations for tasks like classification, translation, and parsing. Our experiments compare systems thatemploy recent visual and text encoding strategies as backbones. The results demonstrate that visual representations outperform textual representations for some investigated tasks, suggesting that visual processing pipelines may unlock a large amount of cultural heritage data of logographic languages for NLP-based analyses. Data and code are available at https: //logogramNLP.github.io/.
Anthology ID:
2024.luhme-long.768
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14238–14254
Language:
URL:
https://aclanthology.org/2024.luhme-long.768/
DOI:
10.18653/v1/2024.acl-long.768
Bibkey:
Cite (ACL):
Danlu Chen, Freda Shi, Aditi Agarwal, Jacobo Myerston, and Taylor Berg-Kirkpatrick. 2024. LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14238–14254, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP (Chen et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.768.pdf