Normalization of Indonesian-English Code-Mixed Twitter Data

Anab Maulana Barik, Rahmad Mahendra, Mirna Adriani


Abstract
Twitter is an excellent source of data for NLP researches as it offers tremendous amount of textual data. However, processing tweet to extract meaningful information is very challenging, at least for two reasons: (i) using nonstandard words as well as informal writing manner, and (ii) code-mixing issues, which is combining multiple languages in single tweet conversation. Most of the previous works have addressed both issues in isolated different task. In this study, we work on normalization task in code-mixed Twitter data, more specifically in Indonesian-English language. We propose a pipeline that consists of four modules, i.e tokenization, language identification, lexical normalization, and translation. Another contribution is to provide a gold standard of Indonesian-English code-mixed data for each module.
Anthology ID:
D19-5554
Volume:
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
417–424
Language:
URL:
https://aclanthology.org/D19-5554
DOI:
10.18653/v1/D19-5554
Bibkey:
Cite (ACL):
Anab Maulana Barik, Rahmad Mahendra, and Mirna Adriani. 2019. Normalization of Indonesian-English Code-Mixed Twitter Data. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 417–424, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Normalization of Indonesian-English Code-Mixed Twitter Data (Barik et al., WNUT 2019)
Copy Citation:
PDF:
https://aclanthology.org/D19-5554.pdf