AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT

The success of Neural Machine Translation (NMT) largely depends on the availability of large bitext training corpora. Due to the lack of such large corpora in low-resource language pairs, NMT systems often exhibit poor performance. Extra relevant monolingual data often helps, but acquiring it could be quite expensive, especially for low-resource languages. Moreover, domain mismatch between bitext (train/test) and monolingual data might degrade the performance. To alleviate such issues, we propose AUGVIC, a novel data augmentation framework for low-resource NMT which exploits the vicinal samples of the given bitext without using any extra monolingual data explicitly. It can diversify the in-domain bitext data with finer level control. Through extensive experiments on four low-resource language pairs comprising data from different domains, we have shown that our method is comparable to the traditional back-translation that uses extra in-domain monolingual data. When we combine the synthetic parallel data generated from AUGVIC with the ones from the extra monolingual data, we achieve further improvements. We show that AUGVIC helps to attenuate the discrepancies between relevant and distant-domain monolingual data in traditional back-translation. To understand the contributions of different components of AUGVIC, we perform an in-depth framework analysis.


Introduction
Neural Machine Transaltion (NMT) has shown impressive performance in high-resource settings, even claiming to achieve parity with human professional translators (Hassan et al., 2018;Popel et al., 2020). Most successful NMT systems have billions of parameters (Lepikhin et al., 2021). They generally work well only when a good amount of parallel training data is available and perform poorly * Equal contribution in low-resource conditions (Koehn and Knowles, 2017;Guzmán et al., 2019). However, majority of the languages are low-resourced despite being used by large portion of world population. Hence, improving low-resource MT quality has been of great interests to the MT researchers.
There have been several attempts to extend the success of NMT in high-resource settings to lowresource language pairs that have a relatively small amount of available parallel data. Most of these methods mainly focus on leveraging extra monolingual data through back-translation (Sennrich et al., 2016) and self-training , or translation knowledge transfer through parallel data involving other assisting language pairs (Firat et al., 2016a,b;Johnson et al., 2017;Neubig and Hu, 2018). 1 Large scale pre-training is another recent trend to utilize large monolingual data for NMT . However, very few work has considered low-resource NMT without using auxiliary data or other pivot languages.
In the presence of a sufficient amount of indomain monolingual data, back-translation (BT) has proved to be quite successful (Edunov et al., 2018). In this approach, a reverse intermediate model is trained on the original parallel data, which is later used to generate synthetic parallel data by translating sentences from target-side monolingual data into the source language. However, when there are scarcity of in-domain data which indeed a common situation in many low-resource settings, the success of BT may be limited .
Another understudied problem with BT is the issue with domain mismatch . To elaborate, let us consider two scenarios: (i) the training and testing data come from the same or relevant domains (e.g., News), and (ii) the test domain (News) is different from the training domain (e.g., Subtitles). In the former case, we can foresee two problems. First, if we use out-of-domain monolingual data which is abundant, it might misguide the model and move it far away from the actual test distribution. Second, even if the monolingual data is from a domain similar to that of the training/testing data, there might be differences in topics, modality, style, etc., which might induce noise.
For the latter scenario, even if the monolingual data comes from the similar domain as the test data (News), the corresponding (reverse) translations will be noisy as the intermediate model would be trained on a different domain (Subtitles). Consequently, these noisy pseudo-parallel data will induce noise during training and might cause the model to perform worse (Wang et al., 2018). On the other hand, using in-domain (Subtitles) monolingual data in back-translation will not give enough diversity to cover the test domain (News).
In this work, inspired by the Vicinal Risk Minimization principle (Chapelle et al., 2001), we propose AUGVIC, a novel method to augment vicinal samples around the bitext distribution. Instead of using extra monolingual data, AUGVIC aims to leverage the vicinal samples of the original bitext, thereby enlarging the support of the training bitext distribution to improve model generalization. The main advantage is that the resulting distribution remains close to the original distribution and can be controlled at a finer level (Figure 1).
With the goal of training a source-to-target NMT system, AUGVIC augments vicinal samples in the target language. The vicinal samples are generated by predicting the masked tokens of a target bitext sentence using a pretrained large-scale language model. To generate synthetic bitext data from these augmented vicinal samples through a reverse intermediate (target-to-source) model, we propose two different methods: the first one is based on the traditional BT, while the second one leverages the original source sentence as a guide. Finally, we train the source-to-target model by combining the original parallel data with the synthetic bitext.
In order to demonstrate the effectiveness and robustness of AUGVIC, we conduct extensive experiments on four low-resource language pairs comprising data from different domains. Our results show significant improvements over the bitext baselines with 2.76 BLEU gains on an average on eight different translation tasks without using any extra monolingual data. AUGVIC also com-plements traditional BT with additive gains when extra monolingual data is used. We also show AUGVIC's efficacy in bridging the gap between in-domain and out-of-domain performance in traditional back-translation with monolingual data. We carried out an ablation study to understand the contribution of the diversity factor in our proposed framework. We open-source our framework at https://ntunlpsg.github.io/project/augvic/.

Related Work
Two lines of studies are relevant to our work.
Low-resource NMT Although the main focus of investigation and improvement in NMT has been in high-resource settings, there has been a recent surge of interest in low-resource MT. However, achieving satisfactory performance in low-resource settings turns out to be challenging for NMT systems (Koehn and Knowles, 2017). Recent research has mainly focused on creating and cleaning parallel (Ramasamy et al., 2014;Islam, 2018) and comparable data (Tiedemann, 2012), utilizing bilingual lexicon induction (Conneau et al., 2017;Artetxe et al., 2018;Joty, 2019, 2020;, fine-grained hyperparameter tuning (Sennrich and Zhang, 2019), and using other language pairs as pivot (Cheng et al., 2017;Kim et al., 2019).
Another avenue of research follows multilingual translation, where translation knowledge from highresource language pairs are exploited by training a single NMT system on a mix of high-resource and low-resource language pairs (Firat et al., 2016a,b;Kocmi and Bojar, 2018;Gu et al., 2018;Neubig and Hu, 2018;Guzmán et al., 2019). Zoph et al. (2016) proposed a variant where they pretrain NMT system on a high-resource language pair before finetuning on a target low-resource language pair.
Data Augmentation for NMT Till now, one of the most successful data augmentation strategies in NMT is back-translation (BT) (Sennrich et al., 2016;Hoang et al., 2018), which exploits targetside monolingual data. Edunov et al. (2018) investigated BT extensively and scaled the method to millions of target-side monolingual sentences. Caswell et al. (2019) explored the role of noise in noised-BT and proposed to use a tag for back-translated source sentences. Besides BT, self-training is another data augmentation strategy for NMT which leverages source-side monolingual data (He et al., 2020). Large scale multilingual pre-training followed by bitext fine-tuning is a recent trend to utilize monolingual data for NMT, which is shown to be beneficial (Arivazhagan et al., 2019;Zhu et al., 2020;Lepikhin et al., 2021).
Apart from using extra monolingual data, Xie et al. (2017) show that data noising is an effective regularization method for NMT, while  use noised training. In low-resource settings, Fadaee et al. (2017) augment bitext by replacing a common word with a low-frequency word in the target sentence, and change its corresponding word in the source sentence to improve the translation quality of rare words. Wang et al. (2018) propose an unsupervised data augmentation method for NMT by replacing words in both source and target sentences based on hamming distance. Gao et al. (2019) propose a method that replaces words with a weighted combination of semantically similar words. Recently, Nguyen et al. (2020) propose an in-domain augmentation method by diversifying the available bitext data using multiple forward and backward models. In their follow-up work (Nguyen et al., 2021), they extend the idea to unsupervised MT (UMT) using a cross-model distillation method, where one UMT model's synthetic output is used as input for another UMT model.

Summary
Most of the previous work on improving BT involve either training iteratively or combining BT with self-training using monolingual data blindly without noticing the distributional differences between the monolingual and bitext data. In contrast, in AUGVIC we systematically parameterize the generation of new training samples from the original parallel data. Moreover, the combination of our augmented vicinal samples with monolingual data makes the NMT models more robust and attenuates the prevailing distributional gap.

Method
Let s and t denote the source and target languages respectively, and denote the bitext training corpus containing N sentence pairs with x i and y i coming from s and t languages, respectively. Also, let M s→t is an NMT model that can translate sentences from s to t, and D t mono = {y j } M j=1 denote the monolingual corpus in the target language t containing M sentences.

Traditional Back-Translation
Traditional back-translation (Sennrich et al., 2016) leverages the target-side monolingual corpus. With the aim to train a source-to-target model M s→t , it first trains a reverse intermediate model M t→s using the given bitext D, and use it to translate the extra target-side monolingual data D t mono into source language. This yields a synthetic bitext corpus D syn = {M t→s (y j ), y j )} M j=1 . Then a final model M s→t is trained on {D ∪ D syn } usually by upsampling D to keep the original and synthetic bitext pairs to a certain ratio (generally 1:1).

AUGVIC: Exploiting Bitext Vicinity
For low-resource languages, the amount of available parallel data is limited, hindering training of a good MT system. Moreover, the target language pairs can be quite different (e.g., morphologically, topic distribution) from the high-resource ones, making the translation task more difficult . Also, acquiring large and relevant monolingual corpora in the target language is difficult in low-resource settings and can be quite expensive. The domain mismatch between the monolingual and bitext data is another issue with the traditional back-translation as mentioned in §1.
With the aim to improve model generalization, the core idea of AUGVIC is to leverage the vicinal samples of the given bitext rather than using extra monolingual data. The addition of bitext vicinity also alleviates the domain mismatch issue since the augmented data distribution does not change much from the original bitext distribution. Figure  1 shows an illustrative example of AUGVIC, which works in three basic steps to train a model: (i) Generate vicinal samplesỹ i of the target sentences (y i ) in the bitext data D.
(ii) Produce source-side translationsx i of the vicinal samples to generate synthetic bitextD.
(iii) Train the final source-to-target MT model M s→t using {D ∪D}.
AUGVIC, however, is not mutually exclusive to the traditional back-translation and can be used together when relevant monolingual data is available. In the following, we describe how each of these steps are operationalized with NMT models.

Generation of Vicinal Samples
We first generate vicinal samples for each eligible denote the vicinity distribution around y i , we create a corpus of vicinal samples as: We generate vicinal samples for sentences having lengths between 3 and 100, and V can be modeled with existing syntactic and semantic alternation methods like language model (LM) augmentation (Kobayashi, 2018;Shi et al., 2020;Bari et al., 2021), paraphrase generation , constrained summarization (Laban et al., 2020), and similar sentence retrieval (Du et al., 2020). Most of these methods are supervised requiring extra annotations. Instead, in AUGVIC, we adopt an unsupervised LM augmentation, which makes the framework more robust and flexible to use. Specifically, we use a pretrained XLM-R masked LM (Conneau et al., 2020a) parameterized by θ xlmr as our vicinal model. Thus, the vicinity distribution is defined as V(ỹ i |y i , θ xlmr ). Note that we treat the vicinal model as an external entity, which is not trained/fine-tuned. This disjoint characteristic gives our framework the flexibility to replace θ xlmr even with a better monolingual LM for a specific target language, which in turn makes AUGVIC extendable to utilize stronger LMs that may come in the future.
In a masked LM, one can mask out a token at any position and ask the model to predict at that position. For a meaningful and informed augmentation, we mask out the tokens successively (one at a time) up to a required number determined by a diversity ratio, ρ ∈ (0, 1). For a sentence of length , the successive augmentation can generate at most (2 − 1) × k vicinal samples, where k is the number of output tokens chosen for each masked position. We use k = 1, and pick the one with the highest probability ensuring that it does not match the original token at the masked position. The diversity ratio (ρ) controls how much diverse the vicinal samples can be from the original sentence, and is selected using one of the following two ways: • Fixed diversity ratio Here we use a fixed value for ρ, and select t = × ρ tokens to mask out. We then generate new vicinity samples by predicting new tokens in those masked positions.
• Dynamic diversity ratio Instead of using a fixed value, in this approach we set the diversity ratio dynamically taking the sentence length into consideration. This allows finer-level control for diversification -the longer the sentence is, the smaller should its diversification ratio be. The intuition is that for long sentences, a larger value of ρ will produce vicinal samples which will be far away from the original sample. Specifically, we use the following piece-wise function to find the number of tokens to mask out dynamically: where t min and t max are hyperparameters and represent the minimum and maximum number of tokens to be replaced by the masked LM. The other hyperparameters a, b, and h play the same role as the diversity ratio ρ.
Since we predict tokens for replacement one at a time, we can make the prediction in any of the permutation order of t. So, the maximum number of possible augmentation for a sentence of length is γ = t × t!. We perform stochastic sampling from the distribution of γ to select N vicinal samples. We have added an analysis on the effect of diversity ratio ρ in AUGVIC in §5.5.

Generation of Synthetic Bitext Data
Our objective is to train a source-to-target MT model M s→t . So far, we have the bitext D = {(x i , y i )} N i=1 and target-side monolingual datã D t = {ỹ j } N j=1 which are vicinal to the original target in D. We need a reverse intermediate targetto-source MT model M t→s to translateỹ j intox j , which will give us the synthetic bitext dataD. For this, we experiment with two different models.
(a) Pure Back-Translation (PBT) This is similar to back-translation ( §3.1), where we first train the reverse MT model M t→s using the given bitext D. We then use M t→s to translate the target-side vicinal samplesỹ j ∼D t intox j . This gives a synthetic bitextD = {(x j ,ỹ j )} N j=1 . We use the Transformer architecture (Vaswani et al., 2017) as our reverse intermediate NMT model M t→s .
(b) Guided Back-Translation (GBT) In the illustrative example (Figure 1), we can identify three kinds of pairs: (i) the bitext (x i , y i ), (ii) the vicinal (y i ,ỹ i ), and (iii) the synthetic pair (x i ,ỹ i ). Here, y i is the original translation of source sentence x i andỹ i is the vicinal sample, which can be seen as a perturbation of y i . Hence, we can assume thatx i will also be similar to (perturbed) x i . Our goal is to leverage this extra relational knowledge to improve the translation quality ofx i when generating the synthetic bitextD. Specifically, we use the original source x i as a guide for generating the synthetic translationx i of the target-side vicinal sampleỹ i .
For this, we propose a model based on the Transformer architecture which has two encoders -one for the source sentence (E) and another for the guide sentence (E ), and a decoder (D) (Figure 2). We use the same architecture with the exception that now we have two identical encoders (E and E ). Both the encoders have a stack of L layers, while the decoder has (L + 1) layers.
Training & Inference: We train this model with a dataset of triplets containing (y,x, x), where (x, y) comes from the original bitext andx is a vicinal sample of x to guide the decoder in generating x.
Each of the first L layers of the decoder performs cross-attention on E(y) resulting in decoder states D (L) (x <t |y) at time step t, while the final decoder layer attends on E (x) resulting in a second set of decoder states D (L+1) (x <t |y,x). The two sets of decoder states are then interpolated by taking a convex combination before passing it to a linear layer followed by the Softmax token prediction.
where λ is a hyperparameter that controls the relative contributions from the two encoders, E(y) and E (x), in generating x by the decoder D.
To generate the synthetic bitextD, we need to translateỹ, which will be guided by x. So during inference, we feedỹ to E and x to E to autoregressively generatex with beam search decoding.

Training of the Final Model
We combine the original bitext D and the synthetic bitextD generated from the previous step to train our final source-to-target model M s→t . We use the standard Transformer as our final model.

Datasets and Evaluation Metrics
We conduct experiments on four low-resource language pairs: English (En) to/from Bangla (Bn), Tamil (Ta), Nepalese (Ne), and Sinhala (Si). Even though the En-Bn dataset size is relatively small (∼ 72K pairs), the quality of the bitext is rich, and it covers a diverse set of domains including literature, journalistic texts, instructive texts, administrative texts, and texts treating external communication. Here the distributions in train and test splits are about the same. For En-Ta, the train and test domains are similar, mostly coming from the news (∼ 66.43%). For En-Ne and En-Si, we use the datasets from (Guzmán et al., 2019), where the train and test domains are different. Although these two datasets are comparatively larger (∼ 600K pairs each), the quality of the bitext is poor, requiring further cleaning and deduplication. Table 2 presents the dataset statistics after deduplication where the last column specifies the number of augmented data by our method AUGVIC ( §3.2.1). For a fair comparison with the traditional back-translation, we experiment with the same amount of target-side monolingual data from three domains: news, wiki, and gnome. We collected and cleaned News, Wiki, and Gnome datasets from News-crawl, Wiki-dumps, and Gnome localization guide, respectively. For some languages, the amount of specific domain monolingual data is limited, where we added additional monolingual data of that language from Common Crawl.

Baselines
We compare AUGVIC with the following baselines: (i) Bitext baseline is the model trained with the bitext given with the dataset.
(ii) Upsample baseline Here we upsample the bitext to the same amount of AUGVIC's data.
(iii) Diversification baseline Nguyen et al. (2020) diversifies the original parallel data by using the predictions of multiple forward and backward NMT models. Then they merge the augmented data with the original bitext on which the final NMT model is trained. Their method is directly comparable to AUGVIC, as both methods diversify the original bitext, but in different ways.

Model Settings
We use the Transformer (Vaswani et al., 2017) implementation in Fairseq . We follow the basic architectural settings from (Guzmán et al., 2019), which establishes some standards for low-resource MT. For low-resource "Bitext baseline", they use a smaller (5-layer) Transformer architecture as the dataset is small, while for larger datasets (e.g., with additional synthetic data) they use a bigger (6-layer) model. 2 To keep the architecture the same in the respective rows (Table 3), we use a 6-layer model for "Upsample baseline" and 5-layer for "Bitext baseline". More specifically, for datasets with less than a million bitext pairs, we use an architecture with 5 encoder and 5 decoder layers, where the number of attention heads, embedding dimension, and inner-layer dimension are respectively 8, 512, and 2048. Otherwise, we use a larger Transformer architecture with 6 encoder and 6 decoder layers with the number of attention heads, embedding dimension, and inner-layer dimension of 16, 1024, and 4096, respectively. After deduplication, we tokenize non-English data using the Indic NLP Library. 3 We use the sentencepeiece library 4 to learn the joint Byte-Pair-Encoding (BPE) of size 5000 symbols for each of  the language pair over the raw English and tokenized non-English bitext training data. We tuned the hyper-parameters a, b, h, t min , t max in Eq. 2 and λ in Eq. 4 by small-scale experiments on the validation-sets. We found a = 0.5, b = 2.5, h = 10, t min = 1, and t max = 20 work better. We tuned λ within the range of 0.5 to 0.9. In general, we observe that for smaller sentences (length <= 20), 50-60% successive-tokenreplacement works better while for longer sentences (length > 20), 20-30% token-replacement performs better.
Following Guzmán et al. (2019), we train all the models upto a maximum epoch of 100 with early-stopping enabled based on the validation loss. We use the beam-search-decoding for inference. All the reported results for AUGVIC use dynamic diversity ratio for generating vicinal samples unless otherwise specified.

Results and Analysis
In this section, we present our results and the analysis of our proposed methods. Table 3 presents the BLEU scores on the eight translation tasks. First, we compare our model AUGVIC with the model trained on the original parallel data (Bitext). AUGVIC consistently improves the performance over all the tested language pairs, gaining about +2.76 BLEU scores on average. Specifically, AUGVIC achieves the absolute improvements of 4.28, 5.78, 1.35, 2.39, 1.88, 2.31, 1.70, and 1.82 over the Bitext for En-Bn, Bn-En, En-Ta, Ta-En, En-Ne, Ne-En, En-Si, and Si-En, respectively.

Comparison with Bitext & Diversification
For a fair comparison, in another experiment, we upsample the bitext data to make it similar to the amount of AUGVIC's data. From the Upsample re-sults (with a 6-layer architecture) reported in Table  3, we see that even though it increases the BLEU scores for En to/from {Bn, Ta}, it has negative impacts on En to/from {Ne, Si} where it degrades the performance. Overall, AUGVIC achieves 1.75 BLEU score improvements on an average over the Upsample baseline.
The comparison with the diversification strategy proposed by Nguyen et al. (2020) reveals that AUGVIC outperforms their method by 0.84 BLEU scores on average. To be specific, our method gets 0.49, 0.85, 0.19, 0.14, 0.77, 1.75, 1.46, and 1.07 absolute BLEU improvements over their approach for En-Bn, Bn-En, En-Ta, Ta-En, En-Ne, Ne-En, En-Si, and Si-En, respectively.
The data diversification method of Nguyen et al. (2020) relies heavily on the performance of base models (Bitext). From Table 3, we see that the performance of base models are poor for En to/from {Ne, Si}, which impacts their augmented data generation process (diversification). However, the better performance of AUGVIC in those languages indicates that vicinal samples generated in our method are more diverse with better quality and less prone to the noise in base models.

Vicinal Samples with Extra Relevant Monolingual Data
We further explore the performance of AUGVIC by experimenting with the traditional back-translation method ( §3.1) using the same amount of monolingual data. To perceive the variability, we choose to experiment with extra monolingual data from two relevant but different sources -newscrawl (BT-Mono (News)) and Wikipedia (BT-Mono (Wiki)). From the results in Table 3, we see that standard back-translation improves the scores in both cases, proving that extra relevant monolingual data helps  for low-resource MT significantly.
To understand the exclusivity of the vicinal samples of AUGVIC from the external related monolingual data, we perform another set of experiments where we added both the AUGVIC's augmented data with the extra monolingual data and trained along with the Bitext data. From Table 3, we see that the combination of datasets improves the BLEU scores by 1.02 and 0.73 on average on the two relevant data sources (News and Wiki). From this, we can conclude that vicinal samples of AUGVIC make the NMT models more robust in the presence of the relevant monolingual data and can be used together when available.

Pure vs. Guided: Which One is Better?
For all the results of AUGVIC presented in Table 3, we use the pure back-translation (BT) method ( §3.2.2(a)) as the reverse intermediate model. We compare the performance of the guided BT ( §3.2.2(b)) with the pure BT method as the reverse intermediate model in Table 4. From the results, we observe that the guided BT achieves better results in En↔ {Bn, Ta}, while the pure BT achieves better in En↔ {Ne, Si} translation tasks.
We investigated why the guided BT performed poorly in En↔ {Ne, Si} tasks, and found that compared to the En-Bn and En-Ta bitexts, the original bitexts of En-Ne and En-Si languages are very noisy (e.g., bad sentence segmentation, code-mix data), which propagates further noise while using the target translation as a guide for translating the vicinal samples. The diminishing results while upsampling in these two languages (Table 3) supports this claim. From these results, we can say that the better the original bitext quality is, the better the synthetic bitext will be for the guided BT.   Table 5, we see that traditional backtranslation (+ BT) improves the BLEU scores over the Bitext by 4.14 and 2.85 on average for relevantand distant-domain monolingual data, respectively, yielding higher gains for relevant domain, as expected. The addition of vicinal data by AUGVIC (+ AUGVIC+ BT) further improves the scores in both cases; interestingly, the relative improvements are higher in the distant-domain case. Specifically, the average BLEU score improvements over Bitext for relevant-and distant-domain data with AUGVIC+BT are 4.97 and 4.41, respectively. Comparing this with BT only, the BLEU score difference between relevant and distant domains has been reduced from 1.29 to 0.56. This indicates that AUGVIC helps to bridge the domain gap between relevant and distant-domain distributions in traditional BT with monolingual data.

AUGVIC
In principle, for vicinal samples, the syntheticpair generation capability of the reverse intermediate target-to-source MT model should be better than generating from an arbitrary monolingual data as it could be a distant distribution compared to the bitext. Judging by the amount of diverse data used for training the language model, we can safely assume that it is a diverse knowledge source (Conneau et al., 2020b) compared to the training bitext samples. Data that performs well on the reverse intermediate target-to-source MT system can be extrapolated from the knowledge-base as vicinaldistribution with the controlled diversity ratio function (Eq. 2). Moreover, to achieve more diversity, the use of multiple different language models is also compatible in AUGVIC.

Effect of Diversity Ratio in AUGVIC
For monolingual data, it could be challenging to identify domain discrepancy with the training/testing bitext data, and there is no parameter in the traditional BT method to control this distributional mismatch. However, in AUGVIC we can control the distributional drift of the generated vicinal samples from the original training distribution by varying the diversity ratio ρ.
Theoretically, it is possible to sample the same distribution using dynamic and static diversity. However, dynamic diversity is more flexible to perform hyperparameter-tuning and to prevent potential outliers. The term l/h in Eq. 2 represents pseudo-segmentation (h segments) of a large sentence of length l, and b represents the same intuition as ρ. Apart from these, t min and t max prevents irregular-samples: (i) t min ensures that there should be at least some changes in the augmented sample, (ii) t max makes sure that the generatedsamples from LM do not diverge too much from the vicinity.
To understand the effect of the diversity ratio in AUGVIC, we perform another set of experiments. We choose to use En to/from {Bn, Ne} for this experiments, where we selected at most two vicinal samples from each of the target sentence in original bitext. We investigate the effect of both dynamic and fixed diversity ratio in AUGVIC's vicinal sample generation ( §3.2.1). For fixed diversity ratio we use ρ values 0.1, 0.3, 0.5, and 0.8, while for dynamic diversity ratio we use a = 0.5, b = 2.5, and h = 10 for controlling the diversity.
We present these experimental results in Table  6, from where we see that the dynamic diversity ratio performs better in three out of four tasks. For the fixed diversity ratio, we see the variation in results for different values of ρ. In all the four tasks, the diversity ratio ρ = 0.8 gives the least scores. On average, we get the better results with ρ = {0.3, 0.5}. These experiments suggest that higher diversity values may induce noise and lower diversity values may not diversify the data enough to benefit the final NMT model.

Conclusion
We have presented an in-domain data augmentation framework AUGVIC by exploiting the bitext vicinity for low-resource NMT. Our method generates vicinal samples by diversifying sentences of the target language in the bitext in a novel way. It is simple yet effective and can be quite useful when extra in-domain monolingual data is limited. Extensive experiments with four low-resource language pairs comprising data from different domains show the efficacy of AUGVIC. Our method is not only comparable with traditional back-translation with in-domain monolingual data, it also makes the NMT models more robust in the presence of relevant monolingual data. Moreover, it bridges the distributional gap for out-of-domain monolingual data when using together.