Multilingual Sequence Labeling Approach to solve Lexical Normalization
Divesh Kubal | Apurva Nagvenkar
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
The task of converting a nonstandard text to a standard and readable text is known as lexical normalization. Almost all the Natural Language Processing (NLP) applications require the text data in normalized form to build quality task-specific models. Hence, lexical normalization has been proven to improve the performance of numerous natural language processing tasks on social media. This study aims to solve the problem of Lexical Normalization by formulating the Lexical Normalization task as a Sequence Labeling problem. This paper proposes a sequence labeling approach to solve the problem of Lexical Normalization in combination with the word-alignment technique. The goal is to use a single model to normalize text in various languages namely Croatian, Danish, Dutch, English, Indonesian-English, German, Italian, Serbian, Slovenian, Spanish, Turkish, and Turkish-German. This is a shared task in “2021 The 7th Workshop on Noisy User-generated Text (W-NUT)” in which the participants are expected to create a system/model that performs lexical normalization, which is the translation of non-canonical texts into their canonical equivalents, comprising data from over 12 languages. The proposed single multilingual model achieves an overall ERR score of 43.75 on intrinsic evaluation and an overall Labeled Attachment Score (LAS) score of 63.12 on extrinsic evaluation. Further, the proposed method achieves the highest Error Reduction Rate (ERR) score of 61.33 among the participants in the shared task. This study highlights the effects of using additional training data to get better results as well as using a pre-trained Language model trained on multiple languages rather than only on one language.