2023
pdf
bib
abs
Dialect and Variant Identification as a Multi-Label Classification Task: A Proposal Based on Near-Duplicate Analysis
Gabriel Bernier-colborne
|
Cyril Goutte
|
Serge Leger
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
We argue that dialect identification should be treated as a multi-label classification problem rather than the single-class setting prevalent in existing collections and evaluations. In order to avoid extensive human re-labelling of the data, we propose an analysis of ambiguous near-duplicates in an existing collection covering four variants of French.We show how this analysis helps us provide multiple labels for a significant subset of the original data, therefore enriching the annotation with minimal human intervention. The resulting data can then be used to train dialect identifiers in a multi-label setting. Experimental results show that on the enriched dataset, the multi-label classifier produces similar accuracy to the single-label classifier on test cases that are unambiguous (single label), but it increases the macro-averaged F1-score by 0.225 absolute (71% relative gain) on ambiguous texts with multiple labels. On the original data, gains on the ambiguous test cases are smaller but still considerable (+0.077 absolute, 20% relative gain), and accuracy on non-ambiguous test cases is again similar in this case. This supports our thesis that modelling dialect identification as a multi-label problem potentially has a positive impact.
2022
pdf
bib
abs
Transfer Learning Improves French Cross-Domain Dialect Identification: NRC @ VarDial 2022
Gabriel Bernier-Colborne
|
Serge Leger
|
Cyril Goutte
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects
We describe the systems developed by the National Research Council Canada for the French Cross-Domain Dialect Identification shared task at the 2022 VarDial evaluation campaign. We evaluated two different approaches to this task: SVM and probabilistic classifiers exploiting n-grams as features, and trained from scratch on the data provided; and a pre-trained French language model, CamemBERT, that we fine-tuned on the dialect identification task. The latter method turned out to improve the macro-F1 score on the test set from 0.344 to 0.430 (25% increase), which indicates that transfer learning can be helpful for dialect identification.
2021
pdf
bib
abs
N-gram and Neural Models for Uralic Language Identification: NRC at VarDial 2021
Gabriel Bernier-Colborne
|
Serge Leger
|
Cyril Goutte
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
We describe the systems developed by the National Research Council Canada for the Uralic language identification shared task at the 2021 VarDial evaluation campaign. We evaluated two different approaches to this task: a probabilistic classifier exploiting only character 5-grams as features, and a character-based neural network pre-trained through self-supervision, then fine-tuned on the language identification task. The former method turned out to perform better, which casts doubt on the usefulness of deep learning methods for language identification, where they have yet to convincingly and consistently outperform simpler and less costly classification algorithms exploiting n-gram features.
2019
pdf
bib
abs
Improving Cuneiform Language Identification with BERT
Gabriel Bernier-Colborne
|
Cyril Goutte
|
Serge Léger
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
We describe the systems developed by the National Research Council Canada for the Cuneiform Language Identification (CLI) shared task at the 2019 VarDial evaluation campaign. We compare a state-of-the-art baseline relying on character n-grams and a traditional statistical classifier, a voting ensemble of classifiers, and a deep learning approach using a Transformer network. We describe how these systems were trained, and analyze the impact of some preprocessing and model estimation decisions. The deep neural network achieved 77% accuracy on the test data, which turned out to be the best performance at the CLI evaluation, establishing a new state-of-the-art for cuneiform language identification.
2017
pdf
bib
abs
Exploring Optimal Voting in Native Language Identification
Cyril Goutte
|
Serge Léger
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
We describe the submissions entered by the National Research Council Canada in the NLI-2017 evaluation. We mainly explored the use of voting, and various ways to optimize the choice and number of voting systems. We also explored the use of features that rely on no linguistic preprocessing. Long ngrams of characters obtained from raw text turned out to yield the best performance on all textual input (written essays and speech transcripts). Voting ensembles turned out to produce small performance gains, with little difference between the various optimization strategies we tried. Our top systems achieved accuracies of 87% on the essay track, 84% on the speech track, and close to 92% by combining essays, speech and i-vectors in the fusion track.
2016
pdf
bib
abs
Discriminating Similar Languages: Evaluations and Explorations
Cyril Goutte
|
Serge Léger
|
Shervin Malmasi
|
Marcos Zampieri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate an upper bound on possible performance using ensemble and oracle combination, and provide learning curves to help us understand which languages are more challenging. A number of difficult sentences are identified and investigated further with human annotation
pdf
bib
abs
Advances in Ngram-based Discrimination of Similar Languages
Cyril Goutte
|
Serge Léger
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
We describe the systems entered by the National Research Council in the 2016 shared task on discriminating similar languages. Like previous years, we relied on character ngram features, and a mixture of discriminative and generative statistical classifiers. We mostly investigated the influence of the amount of data on the performance, in the open task, and compared the two-stage approach (predicting language/group, then variant) to a flat approach. Results suggest that ngrams are still state-of-the-art for language and variant identification, and that additional data has a small but decisive impact.
2015
pdf
bib
Towards Automatic Description of Knowledge Components
Cyril Goutte
|
Guillaume Durand
|
Serge Léger
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications
pdf
bib
Experiments in Discriminating Similar Languages
Cyril Goutte
|
Serge Léger
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects
2014
pdf
bib
The NRC System for Discriminating Similar Languages
Cyril Goutte
|
Serge Léger
|
Marine Carpuat
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects
2013
pdf
bib
Feature Space Selection and Combination for Native Language Identification
Cyril Goutte
|
Serge Léger
|
Marine Carpuat
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications