CL-MoNoise: Cross-lingual Lexical Normalization

Rob van der Goot


Abstract
Social media is notoriously difficult to process for existing natural language processing tools, because of spelling errors, non-standard words, shortenings, non-standard capitalization and punctuation. One method to circumvent these issues is to normalize input data before processing. Most previous work has focused on only one language, which is mostly English. In this paper, we are the first to propose a model for cross-lingual normalization, with which we participate in the WNUT 2021 shared task. To this end, we use MoNoise as a starting point, and make a simple adaptation for cross-lingual application. Our proposed model outperforms the leave-as-is baseline provided by the organizers which copies the input. Furthermore, we explore a completely different model which converts the task to a sequence labeling task. Performance of this second system is low, as it does not take capitalization into account in our implementation.
Anthology ID:
2021.wnut-1.56
Volume:
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
Month:
November
Year:
2021
Address:
Online
Editors:
Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
510–514
Language:
URL:
https://aclanthology.org/2021.wnut-1.56
DOI:
10.18653/v1/2021.wnut-1.56
Bibkey:
Cite (ACL):
Rob van der Goot. 2021. CL-MoNoise: Cross-lingual Lexical Normalization. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 510–514, Online. Association for Computational Linguistics.
Cite (Informal):
CL-MoNoise: Cross-lingual Lexical Normalization (van der Goot, WNUT 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.wnut-1.56.pdf