Optimizing the Size of Subword Vocabularies in Dialect Classification

Vani Kanjirangat, Tanja Samardžić, Ljiljana Dolamic, Fabio Rinaldi


Abstract
Pre-trained models usually come with a pre-defined tokenization and little flexibility as to what subword tokens can be used in downstream tasks. This problem concerns especially multilingual NLP and low-resource languages, which are typically processed using cross-lingual transfer. In this paper, we aim to find out if the right granularity of tokenization is helpful for a text classification task, namely dialect classification. Aiming at generalizations beyond the studied cases, we look for the optimal granularity in four dialect datasets, two with relatively consistent writing (one Arabic and one Indo-Aryan set) and two with considerably inconsistent writing (one Arabic and one Swiss German set). To gain more control over subword tokenization and ensure direct comparability in the experimental settings, we train a CNN classifier from scratch comparing two subword tokenization methods (Unigram model and BPE). For reference, we compare the results obtained in our analysis to the state of the art achieved by fine-tuning pre-trained models. We show that models trained from scratch with an optimal tokenization level perform better than fine-tuned classifiers in the case of highly inconsistent writing. In the case of relatively consistent writing, fine-tuned models remain better regardless of the tokenization level.
Anthology ID:
2023.vardial-1.2
Volume:
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14–30
Language:
URL:
https://aclanthology.org/2023.vardial-1.2
DOI:
10.18653/v1/2023.vardial-1.2
Bibkey:
Cite (ACL):
Vani Kanjirangat, Tanja Samardžić, Ljiljana Dolamic, and Fabio Rinaldi. 2023. Optimizing the Size of Subword Vocabularies in Dialect Classification. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 14–30, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Optimizing the Size of Subword Vocabularies in Dialect Classification (Kanjirangat et al., VarDial 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.vardial-1.2.pdf
Video:
 https://aclanthology.org/2023.vardial-1.2.mp4