GlotScript: A Resource and Tool for Low Resource Writing System Identification

Amir Hossein Kargaran; François Yvon; Hinrich Schütze

GlotScript: A Resource and Tool for Low Resource Writing System Identification

Amir Hossein Kargaran, François Yvon, Hinrich Schütze

Abstract

We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community. GlotScript-R and GlotScript-T are available at https://github.com/cisnlp/GlotScript.

Anthology ID:: 2024.lrec-main.687
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 7774–7784
Language:
URL:: https://aclanthology.org/2024.lrec-main.687/
DOI:
Bibkey:
Cite (ACL):: Amir Hossein Kargaran, François Yvon, and Hinrich Schütze. 2024. GlotScript: A Resource and Tool for Low Resource Writing System Identification. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7774–7784, Torino, Italia. ELRA and ICCL.
Cite (Informal):: GlotScript: A Resource and Tool for Low Resource Writing System Identification (Kargaran et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.687.pdf

PDF Cite Search Fix data