A Parallel Corpus for Vietnamese Central-Northern Dialect Text Transfer

,


Introduction
Owing to the rapid development of natural language processing in the past few years, research on Vietnamese NLP have also benefited in terms of resources and modelling techniques.In particular, many task-specific Vietnamese datasets and dedicated language models have been released to the community (Dao et al., 2022a, Nguyen et al., 2018, Nguyen et al., 2022, Nguyen et al., 2020b, Nguyen and Nguyen, 2021, Tran et al., 2022a, Nguyen and Nguyen, 2020).These benchmarks and models, as they facilitate the research on Vietnamese com-* Corresponding Author putational linguistics, have limitations in that they solely focus on the standard Vietnamese text.Geographically, Vietnamese provinces are categorized into three macro-regions: the Northern, Central and Southern regions -bringing forth different dialects (Hi . p, 2009).Among these, the northern dialect is often treated as the standard i.e. the defacto text style of the language (Pha . m and Mcleod, 2016).Compared to the northern, the southern dialect differs in terms of pronunciation but is mostly similar in terms of lexicon, making the two mutually intelligible (Shimizu andMasaaki, 2021 , Son, 2018).In contrast, the central dialect, besides the phonology deviation, possesses a significant number of vocabulary differences and is thus less mutually intelligible to the remaining two (Michaud et al., 2015, Pham, 2019).In fact, the central dialect is often perceived as "peculiar" or even "difficult to understand" by speakers of other regions due to the existence of various unique words (Hi . p, 2009, Pham, 2019).With a population comprising of nearly one-third of the country, the central region, along with its dialect, remains an important and idiosyncratic part of the nation's culture (Handong et al., 2020, Pha . m and Mcleod, 2016, Pham, 2005).However, existing research works, despite massive advancement in recent years, have mainly focused on the standard dialect so far (Truong et al., 2021, Nguyen et al., 2020a, Nguyen et al., 2017, Nguyen et al., 2020c, Lam et al., 2020), neglecting other variants with potentially rich linguistic values.Furthermore, state-of-the-art industry-level NLP systems, regardless of being production-ready, do not possess understandings of the non-standard dialect.An illustrative example is presented in Figure 1 where a text utterance in the central dialect is input into several renowned translation systems including Google Translate 1 & Yandex Translate 2 .Here the central-style utterance differs from the northern variety at two words răng and ngá, which corresponds to the words sao and ngứa in the northern counterpart.The word răng means why/how in the central dialect, but it also means teeth in the northern (standard) dialect.In contrast, the word ngá means itchy and is lexically unique to the central dialect.Taken together, the input text can be translated as For some reasons I start to feel really itchy or simply Oh I feel so itchy.We can observe that while the translation outputs of the northernstyle utterance are highly relevant, the predicted outputs with respect to the central-style utterance are inapposite.In particular, all systems seem to mistake the meaning of the word răng as teeth and do not properly grasp the meaning of the word ngá.Despite a slight lexical difference between the two dialects, no system manages to correctly translate the central-style utterance, or even close, producing completely unconnected contents.This phenomenon can be considered a well-known bias in current NLP research, where developed models only function towards a particular, majority group of users, ignoring the needs of other minor communities e.g.translation for high-resource languages versus low-resource ones (Lignos et al., 2019, ter Hoeve et al., 2022, Ghosh and Caliskan, 2023).In the case of Vietnamese language, it manifested as the dialectal bias where the non-prestige (i.e.central) is not comprehended, even by well-tailored systems -a conceivable issue stemming from the lack of appropriate training data.
Inclined to provide a remedy to this dilemma, we introduce a new parallel corpus encompassing the central and northern dialect.Constructed through manual effort under strict quality-control, the corpus is specifically designed for the dialect transfer task, with meaningful applications towards facilitating the development of more well-rounded NLP 1 https://translate.google.com/ 2 https://translate.yandex.com/models that are not only potent towards the standard text but can also handle the central-style wordings.We extensively evaluate several monolingual and multilingual language models on their ability to shift the dialect of an input utterance, which requires deep linguistics understandings of the Vietnamese language.In addition, we provide experiments and discussions on the corpus's applications beyond mere dialect transfer, that is, adapting prevailing NLP models to the central dialect domain without the needs to re-train.
Our contributions can be summarized as follows: • We introduce a new parallel corpus for centralnorthern dialect text transfer.We extensively benchmark the capacities of several monolingual and multilingual language models on the task, as well as their abilities to discern between the two dialects.
• We find that for the dialect transfer task, the monolingual models consistently and significantly outperform the multilingual models.
In addition, we observe that all experimented generative models suffer from a fine-tuningpre-training mismatch.
• We show that the competencies of existing models on downstream tasks, including translation and text-image retrieval, degrade when confronting central-style expressions.We further demonstrate that through fine-tuning dialect transfer adapters, the efficacies of these models in the central dialect domain can be tremendously improved without the needs to re-train.

Dataset Construction
In this section, we describe the procedure to construct a parallel corpus for Vietnamese centralnorthern dialect text transfer.

Procedure
Since there are subtle deviations among provinces in the same region, to improve the annotation consistency, we recruited central annotators whose hometowns are located in Ha Tinh -a representative province of the central region, and northern annotators who were born and grew up in Ha Noithe country's capital and also a major metropolitan area of the northern region (Nguyen et al., 2006, Tran et al., 2022b).We also required each annotator to have familiarity with the other dialect (e.g. through work, living experience, etc).Annotators first underwent a one-week training provided by a linguistics expert to assimilate the textual distinctions between the central and northern dialects.It's important that the annotators clearly grasp the dialectal differences in text styles.Before commencing the construction process, the annotators must pass an eligibility test where they are tasked with annotating 10 prototype samples3 .Annotators whose annotation validity did not exceed 80% were re-trained and had to re-take the test until qualified.Eventually, 6 central annotators and 6 northern ones were recruited.We next describe the two steps employed to construct the parallel corpus.The first step only involves central annotators while the second requires the participation of all members.
Step 1 -Central-style Corpus Creation.There are no publicly available contents that are solely in the style of central dialect text.In order to construct a parallel corpus, we need to grasp a collection of raw central-style text utterances4 .To this end, we first divided the central annotators into groups of two people.Each group was asked to act out pre-designed conversation scenarios where they chat with each other employing the central dialect.These scenarios were manually preset, ensuring that the communications are conducted in diverse situations (e.g.friends' casual chatting on romantic stories, workers planning an afterparty, etc), adding up to a number of 112 conversations in total.For every three rounds, we randomly swapped members between groups to maintain fresh perspectives.Upon completion, we asked every central annotator to pick out messages that are central-style specific.A message is considered central-style specific if it contains words with meanings, or lexical appearances, unique to the central dialect.Naturally communicating, not every message is central-style specific, as the central dialect also shares certain lexicon similarities with its northern counterpart.We observed that for every central-style specific message picked out, it received at least 4/6 votes from the annotators, indicating high-level uniformity.We selected all messages with full votes, and held discussion sessions among annotators to resolve the messages with partial votes, in which the linguistics specialist also participated5 .Ultimately, we obtained a set of 3761 central-style specific text utterances.
Step 2 -Dialect Conversion.For the obtained central-style corpus, we need to construct matching northern-style utterances.For this purpose, each central annotator was first paired with a distinct northern annotator.We then divided the raw samples into 10 folds, and evenly distributed them to each pair of annotators.In order to annotate a sample, a central annotator must first highlight dialect-specific words6 present in the utterance and convey their meanings, as well as the utterance's, to the northern annotator, who then had to produce an equivalent utterance in the northern-style that fluently conveys the same contents, and is as close in nuances as possible.Following this step, we acquired a compilation of 3761 parallel centralnorthern utterance pairs.We present the corpus's base statistics in Table 1.The corpus was originally constructed in a syllable-separated manner, which is the natural appearance of Vietnamese text (Nguyen et al., 2018).However, for the Vietnamese language, space is also used to segment syllables of the same words (Dinh et al., 2008).For example, the text utterance "Tôi là nghiên cứu sinh" comprises of 5 separate syllables ["Tôi", "là", "nghiên", "cứu", "sinh"] that composite 3 words ["Tôi", "là", "nghiên cứu sinh"].A majority of works employed the RDRsegmenter software (Nguyen et al., 2018) for automatic word segmentation (Dao et al., 2022a, Dao et al., 2022b, Nguyen and Nguyen, 2021, Truong et al., 2021, Nguyen et al., 2020a, Nguyen et al., 2017).As the tool was trained on the standard text, we first investigated its reliability in segmenting central-style variants.For this purpose, we executed the tool on central-style sequences and randomly select a subset of 100 samples where at least 2 word-merging operations were performed.Manually evaluating the tool's precision, we find that it achieved 96.25% in precision for this particular subdivision.However, correctly segmented words were mostly standard words.Since we had annotated boundaries for central-style words, we next applied the software on all central-style sequences and calculated its recall rate as well as error rate7 regarding central-style words.We found the recall rate to be extremely low (3.04%) which validated our hypothesis that the segmenter was not aware of central-style (non-standard) words and necessitated manual efforts for these specific words (which we did).In contrast, the error rate was fairly small (1.81%) which, adding up the high precision measured earlier, indicated that the software conducted segmentations conservatively i.e. avoiding words it did not know.These preliminary inspections showed that RDRsegmenter's predictions are prudently reliable on standard words but its effectiveness on central-style words is nugatory.Followingly, we executed the tools to obtain automatic boundaries for standard words8 .This resulted in the word-segmented version of the corpus.We report the percentage of novel n-grams in the central-style samples with respect to the northern counterparts.At the unigram level, the two dialects have a nearly 50% lexical distinction, bespeaking the uniqueness of the central dialect's vocabulary.Table 3: Inter-annotator agreement.

Quality Control
To validate the dataset's quality, we randomly designated 100 pair of samples and requested each annotator duo (central & northern) to rate their agreement with the conversion ranging from 1 to 5. For each central participant, we further asked he/she to appraise the annotation for central-style words and provide consensus scores on a similar scale.We report the Fleiss' Kappa scores (Fleiss, 1971) along with overall agreement in Table 3.Compared to the conversion task, we observed that the compromise on central-style word label was slightly lower.We later held a meeting with the annotators to investigate the cause and found that it emanated from ambiguities in determining word boundaries.Nevertheless, the statistics signified substantial inter-annotator agreement in each commisson.
3 Task Formulation Dialect Text Transfer.Given a text utterance x provided in the style of dialect a, we need to convert it to a targeted dialect b while preserving the meaning of x.
Formally, denote x = [x 0 , x 1 ...x n ] as the sequence of input tokens and y = [y 0 , y 1 ...y m ] as the desired output tokens, we would like to model the conditional distribution P (y|x).Training involves minimizing the negative log-likelihood L = − t i=1 log(P θ (y t |y <t , x)) where θ represents the model's parameters.
In this work, we consider two settings: centralto-northern and northern-to-central.Tackling the prior means that we can readily adapt existing Vietnamese NLP models to handle the central dialect domain whereas the latter aids in synthesizing central-style data from existing standard corpus.Both directions have widespread applications and can facilitate building intelligent agents with more inclusive comprehension and capabilities.
During training, we used a batch size of 32 and a learning rate of 1e − 6 along with the AdamW (Loshchilov and Hutter, 2017) optimizer with lin-ear decay schedulers.All models were fine-tuned for a maximum of 300 epochs with early stopping.All settings were implemented with the PyTorch (Paszke et al., 2019) framework and the Transformers (Wolf et al., 2019) library.For each model, the top-5 checkpoints with lowest validation losses were selected and evaluated on the test set.We used greedy decoding in all experiments unless explicitly mentioned otherwise.We also considered SBERT-based (Reimers and Gurevych, 2019) retrieval baselines as lower bounds.In particular, we adopted two multilingual SBERT models (Reimers and Gurevych, 2020) denoted as Retrieval-M1 and Retrieval-M2 along with a publicly available Vietnamese SBERT model hereafter abbreviated as Retrieval-Vi.In each direction, we first encoded the input utterance and retrieved the sample with the closest semantic distance9 in the target dialect from the training set.For details on pre-trained checkpoints, please see Appendix A.
To evaluate the quality of predicted sequences, we adopted a set of automatic evaluation metrics: ROUGE (Lin, 2004), BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005).For the predictions of generative models, we also conducted human evaluations in which each participant was presented with the gold sequence and predicted outputs from 5 systems10 , conditioned on 100 random test samples.Predicted outputs were shuffled and the participants were not aware of the different models.Each participant then picked out the sequence that he/she thought was the most suitable conversion.For each direction, we employed 3 participants with approriate backgrounds (e.g.centralorigin participants for northern-to-central transition and vice versa).Upon voting, we further held a meeting to resolve conflicts among raters where a linguistics specialist also partook in.For the two directions, we obtained corresponding Fleiss' Kappa scores (Fleiss, 1971) of 0.6447 and 0.6284 which implied substantial agreement.
Dialect Transfer.We present the results for two transfer directions in Table 4 and 5.For both settings, the retrieval baselines perform significantly worse than the generative (GEN) models which is to be presumed.Inspecting the retrieval and GEN models, the monolingual ones consistently and remarkably outperform the multilingual counterparts on every metric.In terms of human preference, the monolingual outputs are also chosen more frequently11 .For the northern-to-central direction, these gaps rise by a large margin.In particular, the BARTpho-syllable-base model outperforms the mBART model by 11 ROUGE-1 points and 22 BLEU scores.The mBART model is also least preferred by human.This shows that the dialect transfer task requires deep understandings of the Vietnamese language, making the monolingual models more fitting and accordingly perform much better than the multilingual ones.
Mismatch between pre-training and finetuning.Traditionally, these generative models were pre-trained on standard Vietnamese text (Tran et al., 2022a, Liu et al., 2020)  Zero-shot and Few-shot settings.Large language models (LLMs) have been shown to possess emergent abilities that allow them to seamlessly generalize to unseen tasks (Brown et al., 2020).As a representative trial, we conducted experiments on ChatGPT 1213 (Ouyang et al., 2022, Ghosh andCaliskan, 2023) -a multilingual agent powered by state-of-the-art LLMs, for both the zero-shot and few-shot (5 exemplars) settings on the central-tonorthern dialect transfer task.Since the chatbot itself possesses high expressiveness, we further asked it to provide explanations for the conducted operations.For each explanation point provided by the chatbot, we manually evaluated if it was valid (i.e. the explanation is linguistically correct, and the inferred operation also adds up).At the sequence level, we evaluated three aspects: fluency, style14 and correctness (the predicted sequence must preserve the utterance's meaning and be fluent in the target dialect).We randomly selected 100 samples from the test set and also included the outputs of the BARTpho-syllable-base model as an upper bound (BARTpho for short).System outputs were anonymized and shuffled, each examined by 3 northern raters.As presented in Table 6, although ChatGPT could produce moderately stylized text, fewer than half of them were fluent and the correctness ratio was below 10%.The chatbot was capable of providing relatively wellgrounded interpretation points with a nearly 30% valid ratio (Table 7), but its reliability still falls behind the fully supervised BARTpho model which achieved over 80% correctness ratio.We further observed that the few-shot setting helped improve the chatbot's performance, validating our initial hypothesis that the performance bottleneck in handling central-style inputs can be remedied with the provision of in-domain data.Nevertheless, with an observed overwhelming gap, multilingual large language models such as ChatGPT are far from replacing task-specific monolingual models such as BARTpho.models to the central-style text is a worth-exploring enactment.We first conduct experiments on the Vietnamese-English translation task with Google Translate as the base translation model 15 .We consider three types of input: the central-style utterance, the ground-truth northern-style utterance, and the one predicted with a BARTpho-syllable-base model.A total of 100 random samples were drawn from the test set, each rated by three participants 16 in terms of content and fluency on a 1-5 scale.The outputs were shuffled, and no participants were aware of the models' outputs beforehand.In the event that two or more outputs are identical, we conducted de-duplication to avoid biases.We report the average scores in Table 8.Here it is visible that the translation model degenerates substantially when confronting central-style inputs compared to the gold northern-style ones.In contrast, when these central-style inputs were transferred to the northern dialect (BARTpho), the translation qualities significantly improved and nearly matched that of the ground truth standards.These findings show that even though model's performance degrades when the input text belongs to the central (non-standard) dialect, we can construct adapters enabling existing NLP models to readily cope with the central-style inputs at high precision.We further present qualitative examples 17 in Figure 3.As an alternative application, we experimented with the text-image retrieval task.Specifically, 15 Responses collected during the period 28/05/2023 -07/06/2023. 16All raters possess 7.0+ IELTS proficiencies 17 Here the northern utterances were transferred by the BARTpho-syllable-base model.we employ a multilingual CLIP model18 (Radford et al., 2021, Reimers andGurevych, 2020) to retrieve related images of a Vietnamese text query on the COCO dataset (Lin et al., 2014).We show a qualitative example on Figure 4 where the model receives input text both in the central dialect and the northern counterpart (transferred by the BARTphosyllable-base model).The lower two lines depict images retrieved with the central-style query whereas the upper two lines contain those obtained with the converted standard query.We can see that the dialect of the input query largely affects the relevance of retrieved images especially when the query's key points stem from unique lexicons of the central (non-standard) dialect.In Figure 4, the central-style query contains the unique phrase lấy gấy get married which the CLIP model did not understand and thus retrieved irrelevant images (lower two lines).When processing the northern-style query, the model captured highly relevant outputs (upper two lines are all marriage-related images).

How discriminative is the central dialect ?
To study how anomalous the central dialect is on both the word-and sequence-level, we fine-tuned a set of auto-encoder language models on two tasks: central-style word extraction and dialect detection.The prior can be treated as a sequence labelling task whereas the latter resembles a sequence classification task.Note that the word extraction task is non-trivial, as central-style words can have the same lexical appearance with certain northern-style words in which cases the models have to disambiguate them through context information.For example, the word răng means how/why in the central dialect but also denotes teeth in the northern dialect.
In our experiments, we employed the multilingual XLM-RoBERTa models (Conneau et al., 2019) and the monolingual PhoBERT models (Nguyen and Nguyen, 2020) (large and base variants) 19 .For every model, we picked the top-5 checkpoints with the best validation performance (accuracy or F1) and accordingly reported the mean and standard deviation.Training hyperparameters (learning rate, batch size, etc) remain the same as in previous experiments. 19PhoBERT-Base-V2 uses more pre-training data, but its architecture is the same as PhoBERT-Base At the sequence-level (Table 9), it can be quite easy to distinguish between the two dialects where fine-tuned models achieved near-perfect accuracy.In contrast, at the word-level (Table 10), the central-style attributes are seemingly harder to catch.Nonetheless, we find that the detection (or extraction) performance is decent in general as the accuracy (or F1 score) of every model is well above 95%.We also observe that the monolingual PhoBERT models typically perform better than the XLM models, with the exception of the dialect detection task where the XLM-R-Large model outperforms the PhoBERT models on the test set (albeit by a small margin).On the central-style word extraction task, the PhoBERT models consistently outperform the XLM-R models on both the test and validation splits at the same architecture scale.We hypothesize that the dialect detection task only requires classifying the utterance's dialect on a sequence-level which can be relatively easy, whereas the word extraction task requires more extensive linguistics knowledge including locating word span and disambiguating central-style words from standard ones which the monolingual models do better at.
Test Acc.

Related Works
Research on language varieties, specifically dialects, has been actively developed among many languages including those with Latin (Demszky et al., 2020, Kuparinen, 2023, Samardić et al., 2016, Kuparinen and Scherrer, 2023, Miletic and Scherrer, 2022, Dereza et al., 2023, Ramponi and Casula, 2023, Vaidya and Kane, 2023, Aji et al., 2022) and non-Latin writing systems (Zaidan and Callison-Burch, 2011, Bouamor et al., 2018, Li et al., 2023).The research areas mostly revolve around dialect classification (Kanjirangat et al., 2023, Kuzman et al., 2023) and normalization (Kuparinen andScherrer, 2023 , Hämäläinen et al., 2022), but also span related downstream tasks such as hate speech detection (Castillo-López et al., 2023), sentiment analysis (Srivastava and Chiang, 2023), part-ofspeech tagging (Maehlum et al., 2022) and eyetracking (Li et al., 2023).For the Vietnamese language, dialect-related research remains bounded in traditional linguistics research (Pham, 2019, Hi . p, 2009, Tsukada and Nguyen, 2008, Son, 2018), whereas computational linguistics mainly puts the focus on the standard dialect (Nguyen et al., 2020b, Nguyen et al., 2020c, Lam et al., 2020).Amid the small number of works considering dialectal differences, only signal processing and speech-related tasks are explored (Hung et al., 2016a, Hung et al., 2016b, Schweitzer and Vu, 2016), centering on the tonal (phonetic) deviations between dialects while the textual attributes and their effects on downstream tasks remain unexplored.In recent years, a plethora of Vietnamese NLP datasets have been released to facilitate the development of downstream tasks including intent detection and slot filling (Dao et al., 2022b), speech translation (Nguyen et al., 2022), machine translation (Doan et al., 2021), named entity recognition (Truong et al., 2021) and text-to-sql (Nguyen et al., 2020a).These corpora have the same limitations in that they do not take into account the (non-standard) central dialect and only target the standard text.While they accelerate the progress on a number of tasks, systems trained on these datasets potentially carry the same drawbacks in that they are not apt to confront the central-style wordings which might in turn cause the existing unfairness to become more severe.
As growing needs for text style transfer emerge, the field has been actively receiving attention from the research communities (Jin et al., 2020).The settings vastly differ per situational basis, ranging from business use cases such as formality (Briakou et al., 2021), politeness (Madaan et al., 2020), authorship (Carlson et al., 2017), simplicity (den Bercken et al., 2019) to scenarios that aim at mitigating social issues such as toxicity (dos Santos et al., 2018), sarcasm (Tay et al., 2018b), gender (Prabhumoye et al., 2018), sentiment (He and McAuley, 2016;Tay et al., 2018b,a,c), biases (Voigt et al., 2018).Among them, many are developed with extended applications in mind i.e. facilitating the progress of other tasks such as paraphrasing (Yamshchikov et al., 2020), summarizing (Bertsch et al., 2022) or producing style-specific translation (Wu et al., 2020).In our work, we focus on a more language-oriented setting, that is, the transfer between different dialects (i.e. the central and northern dialects), with meaningful pertinence towards more inclusive NLP models.

Conclusion
In this paper, we tackle the dialect transfer problem for the Vietnamese language.In particular, we introduce a new benchmark for Vietnamese central-northern dialect text transfer.Through immense experiments, we discover the deficiencies of multilingual models for the task compared to the monolingual counterparts.We also highlight the performance bias of existing NLP systems regarding the Vietnamese central dialect.As a prospective remedy, we further demonstrate that the finetuned transfer modules can empower existing models to address the central-style wordings without the needs for re-training.

Limitations
Although our work addresses practical problems specific to the Vietnamese language and the central dialect, the corpus was constructed with the participation of annotators from representative provinces only (i.e.Ha Tinh and Ha Noi).While this decision consolidates the annotation consistencies, the corpus only represents a major portion of the idiosyncratic features of the two dialects, and not their entirety.Therefore, closer inspection at the subtle deviations between other related provinces can provide more insightful looks into the characteristics of the two dialects.Additionally, although we put more focus on the applications of central-tonorthern transfer in this work as they aid in readily adapting existing NLP models to the central-style text, the reverse direction also appeals as it can facilitate the synthesis of central-style task-specific data for Vietnamese NLP research.As the experiments demonstrate a decent level of performance in fine-tuned models, leveraging them to synthesize and assemble targeted resources in the central dialect can be an impactful direction for future works.

C Attention Visualization
To better understand the generative process of the trained models, we visualize the last decoder layer's cross-attention maps of random test samples with BERTViz (Vig, 2019).Here we pick the central-to-northern transfer direction and choose the outputs of three models: BARTpho-syllablebase, BARTpho-word-base and mBART.In Figure 5, the token đồ (central-style) corresponds to the tokens các kiểu (northern-style).We can observe that the attention maps of the BARTpho models are more accurate, aligning with the target tokens, whereas the mBART model's attention maps are more vague, and not quite correct.In Figure 6, the token ngài (central-style) corresponds to the token người in the northern dialect.The attention maps of the three models are however, a bit off, with main focus on the preceding token bốn instead of the token người.Still, we can perceive that in the cases of BARTpho models, there are more attention heads pointing to the token người than the mBART model.

D Prompts
For the zero-shot experiments with ChatGPT, we used the following template:

Figure 1 :
Figure 1: Industry-level translation systems respond differently regarding the central dialect

Fleiss
, conditioning them to produce Vietnamese text of the central dialect (which is non-standard) might induce an undesirable mismatch.Given an evaluation metric P , denote P CN (central-to-northern) and P N C (northernto-central) as the model's performance in each direction.We define δ P = P CN − P N C as the performance difference when conditioned to generate the northern-style text versus the central-style one.To scrutinize the mentioned phenomenon, we illustrate δ ROU GE−1 in Figure2.It can be seen that performance drops manifest in every model type across different decoding beams.The mBART model has the largest drop of roughly 8 points, where each monolingual model exhibits a drop of around 1-2 points.This again betokens the inferiority of multilingual model in the dialect transfer task.
Convert the following Vietnamese central dialect text utterance into the northern dialect.Explain the difference and how you do it.Central Text: {Central-style test input} Northern Text:For the few-shot experiments, we randomly sample 5 exemplars from the training set for each prompt and used a similar template: (a) BARTpho-syllable-base (b) BARTpho-word-base (c) mBART

Figure 5 :
Figure 5: Visualization of attention map -Transferring from the central (right) to northern (left) dialect (1)

Figure 6 :
Figure 6: Visualization of attention map -Transferring from the central (right) to northern (left) dialect (2)

Table 5 :
Results for northern-to-central dialect transfer (test set).We highlight the best results in each column and underline those in second.

Table 7 :
Human analysis of ChatGPT's interpretation points for central-to-northern dialect transfer.

Table 8 :
Effects of dialect style on Vietnamese-English translation for Google Translate.

Table 9 :
Results on dialect detection.

Table 10 :
Results on central-style word extraction.

Table 11 :
List of pre-trained models and their according namespaces in Huggingface.

Table 13 :
Effects of different beam sizes (northern-to-central).