Robust Text Classification using Sub-Word Information in Input Word Representations.

Bhanu Prakash Mahanti, Priyank Chhipa, Vivek Sridhar, Vinuthkumar Prasan


Abstract
Word based deep learning approaches have been used with increasing success recently to solve Natural Language Processing problems like Machine Translation, Language Modelling and Text Classification. However, performance of these word based models is limited by the vocabulary of the training corpus. Alternate approaches using character based models have been proposed to overcome the unseen word problems arising for a variety of reasons. However, character based models fail to capture the sequential relationship of words inherently present in texts. Hence, there is scope for improvement by addressing the unseen word problem while also maintaining the sequential context through word based models. In this work, we propose a method where the input embedding vector incorporates sub-word information but is also suitable for use with models which successfully capture the sequential nature of text. We further attempt to establish that using such a word representation as input makes the model robust to unseen words, particularly arising due to tokenization and spelling errors, which is a common problem in systems where a typing interface is one of the input modalities.
Anthology ID:
2019.icon-1.1
Volume:
Proceedings of the 16th International Conference on Natural Language Processing
Month:
December
Year:
2019
Address:
International Institute of Information Technology, Hyderabad, India
Editors:
Dipti Misra Sharma, Pushpak Bhattacharya
Venue:
ICON
SIG:
Publisher:
NLP Association of India
Note:
Pages:
1–8
Language:
URL:
https://aclanthology.org/2019.icon-1.1
DOI:
Bibkey:
Cite (ACL):
Bhanu Prakash Mahanti, Priyank Chhipa, Vivek Sridhar, and Vinuthkumar Prasan. 2019. Robust Text Classification using Sub-Word Information in Input Word Representations.. In Proceedings of the 16th International Conference on Natural Language Processing, pages 1–8, International Institute of Information Technology, Hyderabad, India. NLP Association of India.
Cite (Informal):
Robust Text Classification using Sub-Word Information in Input Word Representations. (Mahanti et al., ICON 2019)
Copy Citation:
PDF:
https://aclanthology.org/2019.icon-1.1.pdf
Data
AG News