Dialect and Variant Identification as a Multi-Label Classification Task: A Proposal Based on Near-Duplicate Analysis

Gabriel Bernier-Colborne; Cyril Goutte; Serge Léger

doi:10.18653/v1/2023.vardial-1.15

Dialect and Variant Identification as a Multi-Label Classification Task: A Proposal Based on Near-Duplicate Analysis

Gabriel Bernier-colborne, Cyril Goutte, Serge Leger

Abstract

We argue that dialect identification should be treated as a multi-label classification problem rather than the single-class setting prevalent in existing collections and evaluations. In order to avoid extensive human re-labelling of the data, we propose an analysis of ambiguous near-duplicates in an existing collection covering four variants of French.We show how this analysis helps us provide multiple labels for a significant subset of the original data, therefore enriching the annotation with minimal human intervention. The resulting data can then be used to train dialect identifiers in a multi-label setting. Experimental results show that on the enriched dataset, the multi-label classifier produces similar accuracy to the single-label classifier on test cases that are unambiguous (single label), but it increases the macro-averaged F1-score by 0.225 absolute (71% relative gain) on ambiguous texts with multiple labels. On the original data, gains on the ambiguous test cases are smaller but still considerable (+0.077 absolute, 20% relative gain), and accuracy on non-ambiguous test cases is again similar in this case. This supports our thesis that modelling dialect identification as a multi-label problem potentially has a positive impact.

Anthology ID:: 2023.vardial-1.15
Volume:: Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
Venue:: VarDial
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 142–151
Language:
URL:: https://aclanthology.org/2023.vardial-1.15/
DOI:: 10.18653/v1/2023.vardial-1.15
Bibkey:
Cite (ACL):: Gabriel Bernier-colborne, Cyril Goutte, and Serge Leger. 2023. Dialect and Variant Identification as a Multi-Label Classification Task: A Proposal Based on Near-Duplicate Analysis. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 142–151, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: Dialect and Variant Identification as a Multi-Label Classification Task: A Proposal Based on Near-Duplicate Analysis (Bernier-colborne et al., VarDial 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.vardial-1.15.pdf
Video:: https://aclanthology.org/2023.vardial-1.15.mp4

PDF Cite Search Video Fix data