Multilingual Sequence Labeling Approach to solve Lexical Normalization

Divesh Kubal; Apurva Nagvenkar

doi:10.18653/v1/2021.wnut-1.51

Multilingual Sequence Labeling Approach to solve Lexical Normalization

Abstract

The task of converting a nonstandard text to a standard and readable text is known as lexical normalization. Almost all the Natural Language Processing (NLP) applications require the text data in normalized form to build quality task-specific models. Hence, lexical normalization has been proven to improve the performance of numerous natural language processing tasks on social media. This study aims to solve the problem of Lexical Normalization by formulating the Lexical Normalization task as a Sequence Labeling problem. This paper proposes a sequence labeling approach to solve the problem of Lexical Normalization in combination with the word-alignment technique. The goal is to use a single model to normalize text in various languages namely Croatian, Danish, Dutch, English, Indonesian-English, German, Italian, Serbian, Slovenian, Spanish, Turkish, and Turkish-German. This is a shared task in “2021 The 7th Workshop on Noisy User-generated Text (W-NUT)” in which the participants are expected to create a system/model that performs lexical normalization, which is the translation of non-canonical texts into their canonical equivalents, comprising data from over 12 languages. The proposed single multilingual model achieves an overall ERR score of 43.75 on intrinsic evaluation and an overall Labeled Attachment Score (LAS) score of 63.12 on extrinsic evaluation. Further, the proposed method achieves the highest Error Reduction Rate (ERR) score of 61.33 among the participants in the shared task. This study highlights the effects of using additional training data to get better results as well as using a pre-trained Language model trained on multiple languages rather than only on one language.

Anthology ID:: 2021.wnut-1.51
Volume:: Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
Month:: November
Year:: 2021
Address:: Online
Editors:: Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:: WNUT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 457–464
Language:
URL:: https://aclanthology.org/2021.wnut-1.51
DOI:: 10.18653/v1/2021.wnut-1.51
Bibkey:
Cite (ACL):: Divesh Kubal and Apurva Nagvenkar. 2021. Multilingual Sequence Labeling Approach to solve Lexical Normalization. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 457–464, Online. Association for Computational Linguistics.
Cite (Informal):: Multilingual Sequence Labeling Approach to solve Lexical Normalization (Kubal & Nagvenkar, WNUT 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.wnut-1.51.pdf

PDF Cite Search