Consonant is all you need: a compact representation of English text for efficient NLP

Maged Al-shaibani, Irfan Ahmad


Abstract
In natural language processing (NLP), the representation of text plays a crucial role in various tasks such as language modeling, sentiment analysis, and machine translation. The standard approach is to represent text in the same way as we, as humans, read and write. In this paper, we propose a novel approach to represent text with only consonants which presents a compact representation of English text that offers improved efficiency without sacrificing performance. We exploit the fact that consonants are more discriminative than vowels and by representing text using consonants, we can significantly reduce the overall memory and compute footprint required for storing and processing textual data. We present two alternative representations: ‘consonants-only’, where we completely remove the vowels from the text, and ‘masked-vowels’, where we mask all the vowels into one special symbol. To evaluate our approaches, we conducted experiments on various NLP tasks, including text classification, part-of-speech (POS) tagging, named-entity recognition (NER), and neural machine translation (NMT), in addition to language modeling. Our results demonstrate that the proposed consonant-based representation achieves comparable performance compared to the standard text representation while requiring significantly fewer computational resources. Furthermore, we show that our representation can be seamlessly integrated with existing NLP models and frameworks, providing a practical solution for efficient text processing. Last but not the least, we present a technique to retrieve the vowel information from our processed text representation keeping in mind the need to reproduce text in human readable form in some NLP applications.
Anthology ID:
2023.findings-emnlp.775
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11578–11588
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.775
DOI:
10.18653/v1/2023.findings-emnlp.775
Bibkey:
Cite (ACL):
Maged Al-shaibani and Irfan Ahmad. 2023. Consonant is all you need: a compact representation of English text for efficient NLP. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11578–11588, Singapore. Association for Computational Linguistics.
Cite (Informal):
Consonant is all you need: a compact representation of English text for efficient NLP (Al-shaibani & Ahmad, Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.775.pdf