Early Guessing for Dialect Identification

This paper deals with the problem of incremental dialect identification. Our goal is to reliably determine the dialect before the full utterance is given as input. The major part of the previous research on dialect identification has been model-centric, focusing on performance. We address a new question: How much input is needed to identify a dialect? Our approach is a data-centric analysis that results in general criteria for finding the shortest input needed to make a plausible guess. Working with three sets of language dialects (Swiss German, Indo-Aryan and Arabic languages), we show that it is possible to generalize across di-alects and datasets with two input shortening criteria: model confidence and minimal input length (adjusted for the input type). The source code for experimental analysis can be found at Github 1 .


Introduction
Language identification depends very much on what kind of languages we are discriminating.If languages to be discriminated are distant (e.g., Russian vs. Chinese), the task is straightforward and a short sequence of words provides enough information to assign the correct class.But if languages are similar and written in the same script (e.g.Russian vs. Ukrainian), much longer samples are needed to encounter the discriminating features (Tiedemann and Ljubešić, 2012).The task is even harder when dealing with non-standard orthography, which we find in written dialects and user posts on the internet (Zampieri et al., 2017).
Current research is mostly concerned with improving the performance of the task by applying increasingly sophisticated methods, including pretrained models, whose performance varies across different language dialects (Jauhiainen et al., 2021).
However, many other aspects of the task may play an important role in practical applications.One of such challenges is the possibility to make early guesses on the language or dialect before seeing the whole message.Such a feature can be especially useful for more dynamic classification of a continuous stream of messages as in transcribed speech or instant messaging, where sentence boundaries are not clearly defined and the input can be segmented arbitrarily.A reliable early classification can improve further processing (e.g., translation, information extraction).
In this paper, we address the problem of early guessing in dialect identification mostly from the data-centric point of view, but also consider some model-centric issues.We search for criteria for shortening the input so that the model performance is the same or similar to the performance obtained with the full input.We perform experimental studies with four datasets representing three sets of dialects with considerably different writing practices (some writings are more standard than others) and find that the same input shortening criteria give the best results in all the cases.
In contrast to most previous work, the main focus is not on improving the performance when the task is the classification of the whole utterance but on finding the minimal input on which an acceptable classification performance can be achieved.This aspect of the problem has been so far only minimally addressed in the literature.One such study was on the influence of the length of the utterances in ADI on Basic Traveling Expression Corpus (BTEC) (Takezawa et al., 2007) for coarse and fine-grained dialect classification using n-gram features with Multinomial Naive Bayes classifier.They found that with an average length of 7 words, it is possible to classify the dialects with acceptable accuracy (Salameh et al., 2018).Our study is a more general data-centric exploration of the early classification involving varied datasets in multiple languages.

Data
For our study, we select four datasets, three offered by the VarDial Evaluation Campaign (see Section 2).We perform three tasks: German Dialect Identification (GDI) The GDI dataset represents four areas of Swiss German: Basel, Bern, Lucerne, and Zurich.Training and the test datasets are obtained from the ArchiMob corpus of Spoken Swiss German (Samardzic et al., 2016).GDI datasets are available from the years 2017-2019.We work with the GDI-2018 in the 4-way classification setting.The ILI task identifies five closely related languages from the Indo-Aryan language family, namely, Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi.For each language, 15,000 sentences are extracted mainly from the literature domain.The sources were previously published either on the internet or in print.These languages are often mistakenly considered to be varieties of Hindi.The ADI Var-Dial task (Malmasi et al., 2016;Ali et al., 2016) focused on five classes, viz., Modern Standard Arabic (MSA), Egyptian (EGY), Gulf (GLF), Levantine (LAV), Moroccan (MOR), and North-African (NOR).MSA is the modern variety of the language used in news and educational articles.This differs from the actual communication language of native speakers lexically, syntactically and phonetically.The VarDial ADI dataset is both speech transcribed and transliterated to English.Another Arabic dialect dataset used for the experimentation is the Arabic Online Commentary (AOC) (Zaidan and Callison-Burch, 2011) dataset.This constitutes a large-scale repository of Arabic dialects and covers MSA and the dialectal varieties, viz., Egyptian (EGY), Gulf (GLF), Levantine (LEV), and Moroccan (MOR).Table 1 reports the data statistics of GDI, ILI and ADI datasets used for our experimentation.We represent the ADI dataset from VarDial as ADI-VarDial and AOC as ADI-AOC, respectively.

Methods
We perform incremental analysis by running the same classifier on varied substrings of the test input.This is to understand the performance at different lengths of the input text.We start with the first word, then repeat the classification with the first two words and so on until we reach the end of the utterances.We refer to all the incremental substrings as fragments.We observe the model's performance at each incremental step and analyze its state (confidence) to determine the earliest point when a plausible guess can be made.We perform extensive analysis on the influence of different factors that directly and indirectly affect the model performance after applying the shortening criteria.Models In the case of each dialect group, we compare the base model (pretrained on English) with one or more models pretrained on a language more closely related to the target dialect group.For the GDI data set, we compared three models: BERT-base-cased model (Devlin et al., 2019), multilingual BERT (mBERT) and German BERT7 .In the case of the ILI dataset, we compared four models: BERT-base-cased, mBERT, IndicTransformers (Jain et al., 2020) and IndicBERT (Kunchukuttan et al., 2020).IndicBERT covers 12 languages, including Hindi, Tamil, English, Malayalam, etc., trained using AI4Bharat's corpus and is based on multilingual ALBERT.8IndicTransformers9 is a BERT model trained with 3 GB of data from the OSCAR corpus10 and covers Hindi, Bengali and Telugu.For ADI-VarDial, we used two main models: BERT-base-cased and AraBERT (Antoun et al., 2020).11 .We did the experiments with mBERT too, which gave poor performance (accuracy of only 28% with 7% F-score) and hence did not carry out further experiments with mBERT in ADI-Vardial.AraBERT is based on BERT-base model; it is additionally pre-trained on Arabic news articles and two publicly available large Arabic corpora covering 24 Arab countries.For the ADI-AOC dataset, we compare three models: BERT-base-cased, mBERT and AraBERT.
Input Shortening We first tokenize the input sentence by splitting it into white spaces.We then create fragments that consist of incrementally increased prefixes of the original utterance.The length of fragments ranges between 1 and N, where N is the original utterance's length (in tokens).For example, consider the test sentence: 'das haisst im klarteggst' of length N=4.The incremental fragments will be: ['das', 'das haisst','das haisst im','das haisst im klarteggst'].
The number of fragments obtained in each case is listed in Appendix A. For each fragment, we obtain predictions using the same fine-tuned model.We collect the information about model prediction and its confidence for further analyses.

Model Confidence Analysis with Temperature Scaling
The confidence scores of a model can be very high (close to 1) even when the predictions are incorrect.Calibration is a method to disincentivize a model from being over-confident (Bella et al., 2010;Nixon et al., 2019;Widmann et al., 2019).Although the transformers models are considered to be well-calibrated (Desai and Durrett, 2020), methods such as temperature scaling (Guo et al., 2017) and label smoothing (Müller et al., 2019) can improve the calibration.We expect this to help, especially for the case of GDI data, where the overall performance is rather low compared to the other datasets.We explore temperature scaling to calibrate the prediction probabilities of our model: we divide the non-normalized logits (before the softmax operation) with the scalar temperature hyperparameter T .After this step, the prediction probability is obtained using the usual Softmax function.In exploring model confidence as a shortening criterion, we use calibrated probabilities.The details of the model calibration are explained in Appendix C.

Experiments and Results
Each model was trained for 4 epochs with Adam optimizer using a learning rate of 2e-5 on the corresponding training set using 1 Tesla K80 GPU.We used the pre-trained models from the HuggingFace library.12BERT-base vs. Linguistic Proximity Table 2 shows the classification accuracy with full input and with shortened input based on the input shortening criteria (Explained in Section 5 and Appendix D).We note that the ILI and ADI-AOC datasets are the closest to standard writing, while much more non-standard writing is found in the other two datasets.Linguistic proximity 13 of the pretrained model seems useful only in combination with standard writing (ILI and ADI-AOC dataset).In ADI-AOC, we can see that AraBERT is more helpful with shortened input than mBERT, although they both achieve similar accuracy with full input.The best performance on the strongly non-standard data (GDI and ADI-VarDial) is still obtained with BERT-base-cased.Note, however, that the best performance on these non-standard datasets is rather low in absolute terms, underlining the need for better models in this domain.
Fragment Length The impact of the fragment lengths on the model is analyzed in Figure 1.The solid lines in the plot represent the ratio of the correct predictions at each fragment length n to the total number of test instances (which is constant for a given corpus).The dashed lines represent the ratio of the correct predictions at fragment length n to the number of possible predictions at that length (i.e.sentences of length n or longer).These graphs show that it is possible to make accurate and useful predictions with shorter fragments, and the additional gain with longer fragments is proportionally lower.A first peak for the GDI data is obtained at the length 4, while the peak is on the length 7 for the ILI dataset.In ADI-AOC, the peak is on 13 We refer to linguistic proximity in context with the language-specific pre-trained models the length 2. The trend in all these cases is the same, modulated by the length of the original utterances (longer in ILI).The trends are somewhat different in the ADI-Vardial dataset: there are no clear peaks, but the best scores are still obtained on shorter segments.
Input Shortening Criteria For each set of dialects, we found the minimum input length for best performances by repeated experiments on different fragment lengths.We found that this minimum fragment length was same as the peak values in Figure 1.For ADI-Vardial, we found that at length 8, the maximum performance is obtained.Further, the maximum input length is determined according to two additional criteria: model confidence after temperature scaling and label consistency.We experiment with several criteria defined in terms of these two variables (described in detail in Appendix D).We find that the same selection criterion gives the best results in all the data sets: the first decrease in the model confidence after the minimum length threshold.In addition to this, label consistency was helpful in all the data sets except GDI: the results improve when we consider the decrease in the model confidence only when the current and the previous labels are the same.While analyzing the performance on the shortened input, we checked what would happen in an ideal case if we knew where to cut the input utterance in each case.This performance, which is the upper bound on our task is, in fact, better than the performance with the full input (details in Appendix B).It provides an empirical justification for the research goal of finding criteria for shortening the input.

Conclusion
We have identified general criteria for making early guesses in dialect identification: language spe-cific minimal length of the input and languageindependent change in the model confidence score (the first decrease in the confidence score).While these criteria do not maintain the performance achieved with the full input, they are the starting point for further optimization, which can eventually lead to an overall improvement on the task.

Limitations
The main limitation of our work is the fact that early guesses do not achieve the performance as the full input.We have shown empirically that it is possible to maintain or even improve the performance with early guessing and that this is a goal worth pursuing.Elaborating methods to achieve this goal remains outside of the scope of the current paper and we leave it for future work.Another limitation concerns the generalization of our findings to standard and non-standard data, which still needs to be better understood.
Kushal Jain, Adwait Deshpande, Kumar Shridhar, Felix Laumann, and Ayushman Dash.2020.Indictransformers: An analysis of transformer language models for indian languages.arXiv preprint arXiv:2011.02323.while considering different starting lengths m.For GDI we found optimal m=4, in ILI m=7 while m=8 and m=2 in ADI-VarDial and ADI-AOC respectively.The results for each input shortening criterion are reported in Tables 3, 4, 5 and 6.All the input shortening criteria are evaluated separately and some of the potential input shortening criteria are evaluated in combination.
In ADI-VarDial and ILI, we observed that the best heuristic was p4 & l1, while in GDI, it was p4.In ADI-AOC, the addition of p1 to p4 improves the performance by 0.21 points.The behaviors were the same across the different models.Generalizing, we found that with p4, the performance is good across all the languages, which means the main criterion is predicted probability(current)>predicted probability(next).In other words, we stop the incremental classification once the model probability starts decreasing.In ILI and ADI-VarDial, the addition of l1 criteria predicted label(current)==predicted label(next) adds to the performance by 0.39 points and 0.87 points respectively.Overall, we observed that it is possible to generalize an input shortening criteria across different language dialects.The slight variations could be due to the nature of the data sets and inherent differences in the languages.For instance, the ILI and ADI-AOC datasets comprised standard written texts, while GDI and ADI-VarDial were transcribed from speeches and the latter was also transliterated.In general, we observed that the model's performance has a direct influence on the results of input shortening criteria.Finally, we observed that for GDI, the mean length of correctly predicted shortened input is 4.9 compared to the 9.5 full length average.In ILI, it is 9.2 compared to 18.5 full length average.In ADI-Vardial, the mean length is 9.02 (compared to 43.02 for full inputs) and 3.58 (compared to 19.89 in full input) in ADI-AOC.

Figure 1 :
Figure 1: Proportion of Correct Predictions at each Fragment Length (%) for GDI, ILI and ADI Datasets.Solid lines represent the proportion to the (constant) test set data size.Dashed lines show the relative proportion of correct predictions in the set of all test instances of the given length.

Table 2 :
The accuracy (%) with different pretrained models on full utterances and on shortened input.

Table 3 :
Input Shortening Results on GDI with Best Model (Bert-base-cased).M= number of fragments that satisfy the criterion.

Table 4 :
Input Shortening Results on ILI with Best Model (IndicTransformers).M= number of fragments that satisfy the criterion.

Table 5 :
Input Shortening Results on ADI-Vardial with Best Model (Bert-base-cased).M= number of fragments that satisfy the criterion.

Table 6 :
Input Shortening Results on ADI-AOC with Best Model (AraBERT).M= number of fragments that satisfy the criterion.