Cross-lingual Data Augmentation for Document-grounded Dialog Systems in Low Resource Languages

This paper proposes a framework to address the issue of data scarcity in Document-Grounded Dialogue Systems(DGDS). Our model leverages high-resource languages to enhance the capability of dialogue generation in low-resource languages. Specifically, We present a novel pipeline CLEM (Cross-Lingual Enhanced Model) including adversarial training retrieval (Retriever and Re-ranker), and Fid (fusion-in-decoder) generator. To further leverage high-resource language, we also propose an innovative architecture to conduct alignment across different languages with translated training. Extensive experiment results demonstrate the effectiveness of our model and we achieved 4th place in the DialDoc 2023 Competition. Therefore, CLEM can serve as a solution to resource scarcity in DGDS and provide useful guidance for multi-lingual alignment tasks.


Introduction
Document-Grounded Dialogue System (DGDS) is a meaningful yet challenging task, which not only allows content accessible to end users via various conversational interfaces, but also requires generating faithful responses according to knowledge resources.
However, in real-world scenarios, we may not have abundant resources to construct an effective dialogue system due to the low resources of some minority languages such as Vietnamese and French.Previous works only consider building a DGDS in high-resource languages with rich document resources such as English and Chinese (Feng et al., 2021;Fu et al., 2022), which is contrary to real-world situations.Extensive minority languages struggle to build well-founded chatbots due to the low resource of documents.
Therefore, how to generate evidential responses under a scarce resources setting deserves our attention.To address this issue, we propose a novel architecture to leverage high-resource languages to supplement low-resource languages, in turn, build a fact-based dialogue system.Thus, our model can not only handle high-resource scenarios but also generate faithful responses under low-resource settings.Our key contributions can be split into three parts: • We proposed a novel framework, dubbed as CLEM, including adversarial training Retriever, Re-ranker and Fid (fusion-indecoder) generator.
• We presented the novel architecture of translated training and three-stage training.
• Extensive results demonstrated the effectiveness of CLEM.Our team won the 4th place in the Third DialDoc Shared-task competition.

Related Work
Document Grounded Dialogue System is an advanced dialogue system that requires the ability to search relevant external knowledge sources in order to generate coherent and informative responses.To evaluate and benchmark the performance of such systems, existing DGDS datasets can be broadly classified into three categories based on their objectives: 1) Chitchat, such as WoW (Dinan et al., 2019), Holl-E (Moghe et al., 2018), and CMU-DoG (Zhou et al., 2018).These datasets typically involve casual and open-ended conversations on various topics; 2) Conversational Reading Comprehension (CRC), which requires the agent to answer questions based on understanding of a given text passage.Examples of CRC datasets include CoQA (Reddy et al., 2019), Abg-CoQA (Guo et al., 2021), and ShARC (Saeidi et al., 2018); and 3) Information-seeking Scenarios, such as Doc2dial (Feng et al., 2020), Multidoc2dial (Feng et al., 2021), and Doc2bot (Fu et al., 2022), where the agent needs to retrieve arXiv:2305.14949v2[cs.CL] 20 Sep 2023 relevant information from one or more documents to address a user's query.

Cross-lingual Data
Augmentation has emerged as an effective approach to address the challenges of multilingual NLP tasks (Zhang et al., 2019;Singh et al., 2019;Riabi et al., 2021;Qin et al., 2020;Bari et al., 2021).Particularly in low-resource language settings, DA has demonstrated its usefulness (Liu et al., 2021;Zhou et al., 2022b,a).Explicit DA techniques mainly involve translation-based templates, such as word-level adversarial learning (Bari et al., 2020) and designed translation templates (Liu et al., 2021;Zhou et al., 2022b).Implicit data augmentation techniques, on the other hand, focus on modeling instead of expanding datasets like representation alignment (Mao et al., 2020), knowledge distillation (Chen et al., 2021) and transfer learning (Schuster et al., 2019).

Task Description
Formulation.We aim to improve the performance of DGDS in low-resource languages (Vietnamese and French).Formally, given labeled set where N D denotes the number of data and x i , p i , r i denotes the input, grounding passage and response.Note that the input is obtained by concatenating the current turn and previous context.In addition, we have access to some high-resource language labeled datasets U with size N U , where N U ≫ N D .Our goal is to explore how to utilize high-resource datasets to enhance performance in low-resource languages (Vietnamese and French).
We have access to two large datasets, namely Multidoc2dial (Feng et al., 2021) for English and Doc2bot for Chinese (Fu et al., 2022).To fully take advantage of these high-resource datasets to enhance the performance in French and Vietnamese, we conducted translated training and generated pseudo-labeled training sets in Vietnamese and French.Specifically, we utilized the Baidu API1 and Tencent API2 to translate English and Chinese into French and Vietnamese, separately.Notably, English and French are Indo-European languages, indicating a common ancestral language, and Chinese and Vietnamese share historical and cultural connections and have influenced each other.

Methodology
We adopt the Retrieve-Rerank-Generation architecture (Glass et al., 2022;Zhang et al., 2023) and incorporate adversarial training into both the Retriever and Re-ranker components.To address the low-resource DGDS scenario, we propose a novel three-stage training approach.

Passage-Retriever With FGM
Given an input x, the retriever aims to retrieve the most relevant top-k documents {z i } k i from a large candidate pool.We follow the schema of conventional Dense Passage Retrieval (DPR) (Karpukhin et al., 2020) for passage retrieval: To improve multi-lingual performance further, where the encoder is initialized from XLM-RoBERTa (Conneau et al., 2019) denote as XLM-R which are used to convert question templates into dense embedding vectors for passage retrieval.Sub-linear time search can be achieved with a Maximum Inner Product Search (MIPS) (Shrivastava and Li, 2014).
In addition, inspired by FGM (Miyato et al., 2017), we extend the adversarial training to document retrieval.We apply infinitesimal perturbations on word embeddings to increase the learning difficulty by constructing adversarial examples.Based on this, the passage retriever is regularized and has better generalization performance since it has to retrieve the correct relevant documents under the attack of adversarial examples.

Passage-Reranker with FGM
Given a shortlist of candidates, the goal of Reranker is to capture deeper interactions between a query x and a candidate passage p. Specifically, the query x and passage p are concatenated to form the input for XLM-RoBERTa (Conneau et al., 2019).And the pooler output of XLM-RoBERTa is considered as similarity score: As in the previous stage, we still employed FGM (Miyato et al., 2017) to add perturbations to word embeddings.

Knowledge-Enhancement Generation
The generator aims to generate correct and factual responses according to the candidates of passages.The key problem is how to leverage the knowledge of passage candidates as much as possible.we adopt Fusion-in-Decoder(FiD) (Izacard and Grave, 2021) as our response generator.During generation, FiD will first encodes every input with multiple passages independently through encoder, and then decodes all encoded feature jointly to generate final response.Concisely, the decoder has extra Cross Attention on more passages feature.This is significant because it is equivalent to improve grounding passage accuracy from top-k to top-n.Note that k ≪ n due to the CUDA memory limitation.
Since prompt-learning is effective in generation proved by previous work (Wei et al., 2021), we also adopt this way by adding the prompt to the front of input query.We choose "please generate the response:" as our prompt, so the final input of generator is "prompt <query> query <passage> passage", where <prompt> and <passage> are special tokens.

Training Process
Our training process consists of three stages.In the first stage, we use all available Chinese and English training corpora to pre-train the model, aiming to develop its primary cross-lingual perception capability.We incorporate downstream finetuning data in this stage as well.We denote this stage as T (D + D t ), where T represents training.
In the second stage, we train the model using translated pseudo data, which includes both noisy data and downstream fine-tuning data.We denote this stage as T (D ′ + D t ).
Finally, we fine-tune the model from the second stage on downstream low-resource training data.We denote this stage as F (D t ), where F represents fine-tuning.
Therefore, the complete training process can be represented as In the Experiment section, we also explore other training processes, such as two-stage training and direct fine-tuning.

Experiments and Results
In this section, we will introduce our datasets and baseline system.Additionally, we will demonstrate the effectiveness of each component in our methodology, such as adversarial training and the novel training process.

Datasets
We train CLEM on the given shared task datasets, containing Vietnamese (3,446 turns), 816 dialogues in French (3,510 turns) and a corpus of 17272 paragraphs in ModelScope3 , where each dialogue turn is grounded in a paragraph from the corpus.Moreover, we also utilize Chinese (5760 turns) and English (26,506 turns) as additional training data.

Baseline System
The baseline follows the pipeline of Retrieval, Re-rank and Generation.(Karpukhin et al., 2020) as retriever and Transformer Encoder (Vaswani et al., 2017) with a linear layer as re-ranker.

Result and Analysis
We evaluate the generation results based on token level F1, SacreBLEU and Rouge-L.The final result is the sum of them.As shown in Table 2, CLEM has a significant improvement by 28% on total result compared to strong baseline, which demonstrates the effectiveness of our method.

Ablation Study
We study the impact of different components of-CLEM, where the results are given in Table 3. Different pseudo corpus As described in section 3, we leverage two translated pseudo corpus Zh-Vi and En-Fr.We also study the impact of each set with two-stage training.From 4th and 5th line of Table 3, the performance without Zh-Vi(Chinese to Vietnamese) and En-Fr(English to French) will decrease, which proved that the translated corpus is useful for shared task.
Without prompt We also run the experiments without prompt to explore the impact of prompt.From the last line of Table 3, the performance of CLEM will decrease sharply.
Without FGM We also explore the effectiveness of FGM (Miyato et al., 2017) at retriever and re-ranker.Results are listed in Table 4.We can observe significant improvements from retrieval to re-rank which prove the effectiveness of re-rank.

Conclusion
This paper introduces CLEM, a novel pipeline for document-grounded dialogue systems that uses a "retrieve, re-rank, and generate" approach.To address the issue of low performance due to limited training data, we extend the adversarial training to the document Retriever and Re-ranker components.Additionally, CLEM leverages highresource languages to improve low-resource languages and develops a new training process under data-scarce settings.
Experimental results demonstrate that CLEM outperforms the strong, competitive baseline and achieved 4th place on the leaderboard of the third DialDoc competition.These findings provide a promising approach for generating grounded dialogues in multilingual settings with limited training data and further demonstrate the effectiveness of leveraging high-resource languages for lowresource language enhancement.

Table 1 :
Statistics of provided datasets.Chinese and English corpus is provided by the third workshop committee of DialDoc.Zh-Vi and En-Fr means the number of translated data from Chinese to Vietnamese and from English to French respectively.
by translating 5000 English examples into French and 5000 Chinese examples into Vietnamese.After filtering out instances of poor quality and excessive length, we ultimately derived 4980 En-Fr and 4908 Zh-Vi pseudo examples.Now we have three training data, cross-lingual training data D, translated pseudo data D ′ and downstream fine-tuning data D t .We will show how to use these data in Section 4.4.And the statistics are presented in Table 1.

Table 2 :
Performance of CLEM on Test set

Table 3 :
Ablation results of Modelon Development set.Here, the best are marked with Bold.Two-stage means we do not use original Chinese and English data.Fine-tune means we just use downstream training data.

Table 4 :
Effect of FGM on Development set, where †means we use adversarial training