ViLexNorm: A Lexical Normalization Corpus for Vietnamese Social Media Text

Thanh-Nhi Nguyen; Thanh-Phong Le; Kiet Nguyen

ViLexNorm: A Lexical Normalization Corpus for Vietnamese Social Media Text

Thanh-Nhi Nguyen, Thanh-Phong Le, Kiet Nguyen

Abstract

Lexical normalization, a fundamental task in Natural Language Processing (NLP), involves the transformation of words into their canonical forms. This process has been proven to benefit various downstream NLP tasks greatly. In this work, we introduce Vietnamese Lexical Normalization (ViLexNorm), the first-ever corpus developed for the Vietnamese lexical normalization task. The corpus comprises over 10,000 pairs of sentences meticulously annotated by human annotators, sourced from public comments on Vietnam’s most popular social media platforms. Various methods were used to evaluate our corpus, and the best-performing system achieved a result of 57.74% using the Error Reduction Rate (ERR) metric (van der Goot, 2019a) with the Leave-As-Is (LAI) baseline. For extrinsic evaluation, employing the model trained on ViLexNorm demonstrates the positive impact of the Vietnamese lexical normalization task on other NLP tasks. Our corpus is publicly available exclusively for research purposes.

Anthology ID:: 2024.eacl-long.85
Volume:: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2024
Address:: St. Julian’s, Malta
Editors:: Yvette Graham, Matthew Purver
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1421–1437
Language:
URL:: https://aclanthology.org/2024.eacl-long.85
DOI:
Bibkey:
Cite (ACL):: Thanh-Nhi Nguyen, Thanh-Phong Le, and Kiet Nguyen. 2024. ViLexNorm: A Lexical Normalization Corpus for Vietnamese Social Media Text. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1421–1437, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):: ViLexNorm: A Lexical Normalization Corpus for Vietnamese Social Media Text (Nguyen et al., EACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.eacl-long.85.pdf
Note:: 2024.eacl-long.85.note.zip

PDF Cite Search Note