ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5

David Samuel, Milan Straka


Abstract
We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 (van der Goot et al., 2021a), which evaluates lexical-normalization systems on 12 social media datasets in 11 languages. We base our solution on a pre-trained byte-level language model, ByT5 (Xue et al., 2021a), which we further pre-train on synthetic data and then fine-tune on authentic normalization data. Our system achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. The source code is released at https://github.com/ufal/multilexnorm2021 and the fine-tuned models at https://huggingface.co/ufal.
Anthology ID:
2021.wnut-1.54
Volume:
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
Month:
November
Year:
2021
Address:
Online
Venues:
EMNLP | WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
483–492
Language:
URL:
https://aclanthology.org/2021.wnut-1.54
DOI:
10.18653/v1/2021.wnut-1.54
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.wnut-1.54.pdf
Code
 ufal/multilexnorm2021