Reference Free Domain Adaptation for Translation of Noisy Questions with Question Specific Rewards

Community Question-Answering (CQA) portals serve as a valuable tool for helping users within an organization. However, making them accessible to non-English-speaking users continues to be a challenge. Translating questions can broaden the community's reach, benefiting individuals with similar inquiries in various languages. Translating questions using Neural Machine Translation (NMT) poses more challenges, especially in noisy environments, where the grammatical correctness of the questions is not monitored. These questions may be phrased as statements by non-native speakers, with incorrect subject-verb order and sometimes even missing question marks. Creating a synthetic parallel corpus from such data is also difficult due to its noisy nature. To address this issue, we propose a training methodology that fine-tunes the NMT system only using source-side data. Our approach balances adequacy and fluency by utilizing a loss function that combines BERTScore and Masked Language Model (MLM) Score. Our method surpasses the conventional Maximum Likelihood Estimation (MLE) based fine-tuning approach, which relies on synthetic target data, by achieving a 1.9 BLEU score improvement. Our model exhibits robustness while we add noise to our baseline, and still achieve 1.1 BLEU improvement and large improvements on TER and BLEURT metrics. Our proposed methodology is model-agnostic and is only necessary during the training phase. We make the codes and datasets publicly available at \url{https://www.iitp.ac.in/~ai-nlp-ml/resources.html#DomainAdapt} for facilitating further research.

1 Introduction E-commerce decision-making heavily depends on community question-answering.When product descriptions and reviews fail to persuade users, they often turn to question-answering forums to What User wants to say What user wrote Does it work in Samsung A50S ?
It works in samsung a50s Will it fit in Xylo E4?
In Xylo E4 will fit Table 1: An illustration of the mismatch between the input received by the MT system and its intended meaning.Considering that the sentence in the second column of the first row is intended to be a question, it is not grammatically correct.
address their concerns.However, English is used extensively on the majority of community questionanswer portals.This situation renders it impossible for non-English speakers to ask questions and make informed purchasing decisions.Additionally, the potential loss of sales negatively affects businesses.Machine Translation (MT) is a valuable tool that enables users to communicate with individuals speaking different languages.
Translating noisy questions differs significantly from general domain data, statements, or answers.Firstly, most currently available general domain data consists of statements rather than questions, making it less effective for translating questions.In the largest publicly accessible dataset (Ramesh et al., 2022), only 3.17% of the total lines contain a question mark.Secondly, questions exhibit a higher frequency of grammatical errors and are frequently presented as statements.For example, a sentence in the question field of a Community QA site -It works in samsung a50s.In this case, the user intends to inquire about the product's compatibility with Samsung A50S, so the question should have been: Does it work in Samsung A50S ?.The initial query is grammatically incorrect and appears more like a statement that the user knows the product works for the Samsung A50S.Moreover, the absence of a question mark ("?") makes it difficult for both humans and automated systems to recognize it as a question unless they are aware it was posted in a community QA site's question field.In regions where English is not the first language, such grammatical errors in user queries are common, and this is particularly true of the Indian subcontinent, which has a very diverse linguistic population.
We aim to develop an NMT system that fluently translates English questions into a target language.In simpler terms, the input should be translated into the output, assuming grammatical correctness.This is challenging because the model must address grammatical errors that may seem grammatically correct at the sentence level.Furthermore, the manual creation of training sets for this data type is time-consuming and costly, as it involves annotating the intended input rather than just translating the text.Therefore, we avoid using parallel data 1 for fine-tuning and instead utilize one pre-trained model as our baseline, fine-tuning it exclusively with source-side data.This paper makes contributions in the following ways: • Our models deal with noisy data during training, which is very challenging in unsupervised domain-adaptation setting.
• Our method can translate sentences that appear grammatically correct on the surface but are grammatically incorrect when considering the contextual information that they are questions or queries.
• We propose a novel domain-adaptation method that balances adequacy and fluency without requiring references.
• Existing unsupervised methods rely on targetside monolingual data, while our methods work on source-side monolingual data.
• Our models deal with noisy data during training, which is very challenging in unsupervised domain-adaptation settings.

Related Work
Neural Machine Translation (NMT) has made significant progress in the past decade and has even reached human-level performance in certain domains and language pairs.However, research in the field of question-answering (QA) is still in its early 1 Technically, we supply the synthetic reference to the model.However, the references were used to sort the data according to their lengths to keep the training data order consistent among different models.The synthetic reference was not used to calculate the loss of the model.stages, with only a few attempts having been made.Vikram and Dwivedi (2018); Dwivedi and Vikram (2018) focused on translating academic question papers, primarily emphasizing word-sense disambiguation.It is important to note that the questions in this context were well-structured and grammatically correct.Gain et al. (2022) tackled the translation of user-generated questions and enhanced the translation quality by incorporating answers alongside questions during the training process.This approach allowed the model to leverage contextual information from the answers.Furthermore, they employed fine-tuning techniques by training the model solely on questions or by using explicit question/answer tags to distinguish between them.Although these methods led to improvements in translation quality, fine-tuning questions alone produced similar results.However, it is important to highlight that the synthetic target-side dataset had limitations in addressing question-specific issues and using synthetic data posed risks related to hallucinations and grammatical errors.The creation of domain-specific noisy question annotations can be costly, rendering question translation infeasible using Maximum Likelihood Estimation (MLE) training.Khayrallah and Koehn (2018) showed that training with noisy data can severely impact the results.Gain et al. (2023) explored the usage of visual context for the translation of noisy texts.
Alternative training methods, such as Minimum Risk Training (MRT) (Shen et al., 2016), are employed to optimize model parameters with respect to arbitrary evaluation metrics, such as BLEU, to achieve superior translation outputs.Edunov et al. (2018) observed that combinations of token-level and sequence-level losses outperformed the use of either loss type individually.It is worth noting that these methods also necessitate access to reference data, which makes them less suitable for translating noisy questions.As a result, the search for unsupervised methods becomes essential.Wieting et al. (2019) introduced a loss function based on semantic similarity, which measures the similarity between hypotheses and references.Dou et al. (2019) harnessed target-side monolingual data to obtain domain-aware feature embeddings through language modeling tasks.Zheng et al. (2021) proposed the creation of a datastore for knearest-neighbor retrieval to facilitate domain adaptation in Neural Machine Translation (NMT) using target-side monolingual data.This method yielded comparable results to traditional back-translation techniques.It is worth noting that all the unsupervised domain adaptation methods (Yang et al., 2018;Zheng et al., 2021) discussed in this context necessitate the availability of target-side monolingual data.However, we only have access to the source-side monolingual data.Therefore, we propose a novel method to fine-tune an NMT model using only source-side data, addressing the limitation of requiring target-side monolingual data in unsupervised domain adaptation, specifically in noisy text.

Background
The Neural Machine Translation (NMT) task can be divided into two major components: Fluency: Ensuring grammatical correctness in the generated output for the target language.Adequacy: Preserving the meaning of the source text in the generated output.NMT systems often produce translations that are fluent but may lack adequacy.Voita et al. (2021) suggested that this is partly because the models tend to prioritize partially translated output over the source sentences during the decoding stage.Achieving a balance between fluency and adequacy remains a challenging task.Most NMT Systems use Maximum Likelihood Estimation (MLE) (Johansen and Juselius, 1990) objective during training.In the Equation 1, where S represents the number of training samples in a batch, N is the number of tokens on the target-side of a training sample, y (s) n is the ground truth token on the target-side at step n, x (s) is the source sentence, y (s) <n represents the target-side tokens from previous steps, and θ denotes the model parameters.Note that during training, the teacher forcing (Williams and Zipser, 1989) method is used for faster convergence and stable training.In teacherforcing method, ground truth tokens y (s) <n are used instead of partially translated output, ŷ(s) <n .Major disadvantages of teacher forcing include • The trained model is exposed only to the training distribution but not its output.However, the reference is not supplied to the model during testing.This makes the model completely rely on its (possibly wrong) partially translated output, creating a discrepancy between training and testing (Ranzato et al., 2015).
• Typically, evaluation metrics in NMT are applied at the sentence or document level.While the Maximum Likelihood Estimation (MLE) objective is effective in achieving high token accuracy, it may not yield optimal results for other metrics such as BLEU (Papineni et al., 2002), TER (Snover et al., 2006), COMET (Rei et al., 2020), etc.
The challenges in NMT can be addressed through various approaches.For instance, a) gradually exposing the model to partially translated output as training progresses has been proposed as a solution (Zhang et al., 2019b).Alternatively, b) one can pretrain the model using MLE objectives before finetuning it with a desired evaluation metric, such as BLEU.These techniques have demonstrated their effectiveness in improving results up to a certain point.Nonetheless, it is worth noting that the MLE objective often performs sufficiently well, especially when applied to clean, extensive datasets alongside suitable regularization techniques.In real-world use cases, many organic datasets, including conversations, question-answers, and reviews, are primarily monolingual.While backtranslation is recognized as an effective technique for leveraging monolingual data, it cannot be utilized when only source-side monolingual data is available.Given that numerous datasets and websites are predominantly in English, and the construction of machine translation systems often involves translating from English to other languages, generating high-quality synthetic parallel datasets through forward-translation can be challenging.In back-translation, the source-side text is synthetic, while the target-side is considered the gold standard.Back-translation aids in robust training, as it introduces errors on the source-side due to the synthetic nature of the text, while the target-side remains correct.This is the opposite of forwardtranslation, making it somewhat less effective but valuable in situations where better alternatives are lacking in standard machine translation systems.
However, employing synthetic data through forward-translation can lead to adverse effects when dealing with noisy text translation.Because the source-side is inherently noisy, forwardtranslated synthetic data will inevitably contain a substantial amount of errors.This, in turn, results in the propagation of errors during model training.
Furthermore, using alternative evaluation metrics as loss functions is often not very helpful, as they rely on gold-standard references, such as the BLEU score, which may not be available.Therefore, we propose a novel loss function that relies solely on source-side sentences, a target-side language model, and a source-side grammatical error correction model.

Methodology
In this section, we first delve into MLM score and BERTScore and explain why we have chosen to incorporate them into our loss function.Subsequently, we detail our proposed training procedure for the translation of noisy questions.

Masked Language Model Score
The Language Model (LM) Score of a sentence can be described as in Equation 2, where y represents the sentence with |y| tokens, and y <n denotes the tokens at previous positions in the sentence.(Salazar et al., 2020) introduced a method known as Masked Language Model (MLM) Score.While the log-likelihood of a token in a Language Model (LM) is conditioned solely on previous tokens, in MLM, it is conditioned on both previous and next tokens, as described in Equation 3. Here, y ∼n denotes all the tokens in the sentence except for the one at the n-th position.Notably, in contrast to LM, the MLM score does not suffer from a left-to-right bias.
LM and MLM scores are commonly employed in MT re-ranking (Olteanu et al., 2006).Typically, a set of K candidate translations is generated using an NMT model.These candidates are then forwarded to either an LM or an MLM for scoring, and the candidate with the highest LM or MLM score is chosen as the final output for a given sentence.This re-ranking technique has been proven to be effective in improving results when compared to a basic model without re-ranking.Language Models tend to favor fluent sentences, which is advantageous in general-domain translation where most candidates are adequate.However, in noisy scenarios, employing MLM scores for re-ranking is less effective.Firstly, candidate translations are more likely to be inadequate compared to non-noisy scenarios, so choosing the candidate with the best MLM score might lead to inadequacy.Furthermore, re-ranking functions as a pipeline between the NMT and MLM models, introducing additional processing time by passing candidate translations to the MLM.This results in increased testing time.Hence, we employ the MLM score as the primary loss function during training to encourage the model to generate fluent utterances.For scoring the candidates, we utilize the bert-base-multilingual-uncased model.Nonetheless, it is important to note that reinforcement learning (RL) models can exhibit rewardhacking behaviors when the rewards are not balanced.Since our rewards are currently based solely on fluency and not adequacy, the model may strive to produce highly fluent but contextually irrelevant sentences to achieve a better loss value.To address this issue, we introduce BERTScore (Zhang et al., 2019a) to our loss function, aiming to strike a balance between adequacy and fluency.Further details about this approach are discussed in the following section.

Pair-wise Cosine Similarity between Source and Candidate
As the target-side data is unavailable, widely used machine translation metrics like BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), COMET, etc., which relies on human-annotated reference sentences cannot be used.Therefore, we search for metrics that can find similarities between the (noisy) source-side and the generated candidate on the target-side.It is essential for the metric to be multilingual to be able to handle the source and target side.Cosine similarity between the multilingual embeddings of the source and candidate appears to be a suitable choice for this purpose.However, using contextual word embeddings (Mc-Cann et al., 2017;Peters et al., 2018;Devlin et al., 2018) Similarly, precision is calculated as specified in Equation 5.In this context, precision reflects the sum of the highest similarity score of the most similar word in the source sentence for each word in the candidate translation.
Finally, F1 score is calculated with R BERT and P BERT .
It is crucial to consider F1 instead of solely focusing on precision or recall.Relying only on precision might lead the models to generate very short sentences that are similar to some of the source words, making them fluent but not containing all the information from the source.Similarly, prioritizing only recall could encourage the model to generate longer sequences with most of the source words but also including words not present in the source.
For scoring our candidates, we employ the mbart-large-50-one-to-many-mmt model (Tang et al., 2020).In the subsequent sections, We will refer to BERTScore as F BERT .

Grammar Error Correction of Source
We've discussed how to address adequacy and fluency in the preceding sections.It's worth noting that adequacy is calculated using BERTScore between the source sentence and the candidate translation.However, due to the presence of noise in the source-side, this approach might penalize the NMT model when it attempts to generate robust candidate translations.For example, consider the source sentence What is the defference between the two, where the spelling of different is incorrect.If BERTScore is applied directly between the source and candidate, it will assign a lower similarity score if the NMT model produces a word similar to difference instead of defference in the target-side.Therefore, we use a publicly available Grammar Error Correction (GEC) model named Gramformer2 .For an input sentence x, we obtain a grammatical correction version x using the Gramformer model.Nevertheless, it is important to note that the GEC model may occasionally make incorrect edits to the source sentence, resulting in a version that is worse than the noisy source itself.To ensure that we do not penalize the models for handling noise, we calculate the final similarity score as max((BERT Score(x, ŷ), (BERT Score(x, ŷ)).Essentially, this approach considers the sentence with the highest similarity score as the most grammatically correct one.

Proposed Model
We use a combination of MLM (subsection 4.1), BERTScore (subsection 4.2) and GEC model (subsection 4.3) to train our model.First, we train an NMT model as a baseline (subsection 5.2) on general domain datasets.We start by feeding the source sentence x into the NMT model to generate K candidate translations.To indicate that these sentences are questions, we append a question mark.However, it's important to note that this approach doesn't always work as intended; at times, the GEC model may interpret the presence of a question mark as a grammatical error and remove it from x. Subsequently, we calculate the similarity score between the source and each candidate using BERTScore.We repeat the same process with the edited source.For each candidate, we compute a metric called Similarity L BERT , which represents the maximum between the two BERTscores, subtracted from one, given that one is the maximum possible value for BERTScore.Subsequently, we evaluate the fluency of the candidate translation using MLM.Finally, based on these scores, we calculate the loss and update the model parameters.It is important to emphasize that these scores serve the purpose of teaching the model to handle noise while maintaining fluency and adequacy.They are not required during testing.As a result, the architecture of the model remains unchanged." The final loss function is presented in Equation 7. In this function, we multiply the probability of the generated sequence with (F(x (s) , ŷ(s) ).Here, S represents the number of sentences in a batch, and N (s) denotes the number of tokens in the generated candidate.Due to the exponential search space of Y(x s ), we employ a sampling approach with K=5 candidates per training sentence, where y ∈ Y(x s ).Beam search is utilized to prevent the duplication of candidates.
F(x (s) , ŷ(s ) is determined as a weighted average between BERTScore Loss and MLM Score Loss.Notably, the MLM score tends to exhibit higher variance in comparison to BERTScore.To optimize the model's performance, we experimented with different sets of weights for α and β and discovered that the weights 0.15 and 0.85 yielded the best results.From Equation 3, we find L M LM = −logP mlm (ŷ).

Experiments and Results
In this section, we will provide an overview of the datasets used, discuss the baseline models, and present the results achieved by our model.

Dataset and Annotation
For pre-training our NMT model, we utilize the Samanantar corpus, which comprises over 10 million sentence pairs for English-Hindi in the general domain.During the fine-tuning process, we focus on the first 50,000 questions from the Flipkart QnA corpus (Gain et al., 2022), using only the English side of the data.In our evaluation, we employ both sides of the test set, consisting of 500 questions.It is important to note that we manually edited some of the references in the test set to enhance fluency, maintain consistency, and ensure user-friendliness.These edits took into account product names, types, and other relevant details to make the references more suitable for questions.
Additionally, we use the Mintaka dataset (Sen et al., 2022) for evaluating our methods.Although Mintaka is a multilingual question-answering dataset, we repurpose it for translation tasks in this study.Notably, during training, we deliberately excluded the Hindi and German sides of the dataset to simulate a scenario in which the target-side of the training data is unavailable."

Baseline
We obtain a pre-trained model from (Gain et al., 2022), which was trained on a large-scale English-Hindi data and use it as our baseline.The baseline consists of standard transformer architecture with six encoders and six decoders.The model is trained on 10.9 million general-domain English-Hindi sentence pairs obtained from (Ramesh et al., 2022).The model achieves 43.8 BLEU, 39.4 TER and 0.7507 BLEURT (Sellam et al., 2020) scores, respectively.Remove full stops from the end, and if the last character of x or any of ŷ0:K is not "?", append "?".

7:
x ← GEC(x) 8: for i=1 to K do 9: end for 14: end for 15: Total Loss = Sum of Losses at each index 16: Repeat Steps 3-15 for the designated number of training steps.

Robust Baseline
Since we deal with noisy user-generated content, for robust training, we implement the following: We apply three types of noise on the source-side of the pre-training dataset.a) Natural Noise: We replace characters with random characters with 1% probability.b) Keyboard Noise: We replace characters with surrounding characters from the keyboard with 5% probability.c) Vowel Removal: Users often do not type vowels.We drop vowels with 5% probability.Then, we combine the clean data with noisy data and train the model.We obtain 44.9 BLEU score with robust baseline, which is a +1.1 improvement from the non-robust baseline.

Domain Adaptation with MLE
We generate synthetic target data by translating the in-domain datasets with the baseline model.
Similarly, we generate synthetic target data from the robust baseline.Then, we initialize the model's weights from the baseline model and fine-tune the model on the respective synthetic data.Note that we did not add noise at this stage as the in-domain dataset is already noisy.We use label-smoothed cross entropy as the loss function and set 0.1 as the smoothing value.The model achieves 45.3 BLEU, 38.0 TER, and 0.7613 BLEURT scores.It outperforms the baseline model by 1.5 BLEU and 1.4 TER scores.After fine-tuning with the robust baseline, we achieve BLEU and TER scores of 45.7 and 37.8, respectively.This indicates a +0.4 BLEU improvement over the MLE method without robust pre-training.

Unsupervised Domain Adaptation with
Cross-Lingual Data Selection Vu et al. (2021) proposed a generalized unsupervised domain adaptation technique (GUDA) for NMT where only monolingual data from either the source or target language is available in the new domain.A cross-lingual data selection method is introduced to select relevant in-domain sentences from a large monolingual corpus for the language without in-domain data.This involves learning an adaptive layer on top of multilingual BERT using contrastive learning to align source and target language representations.A domain classifier trained on the available in-domain monolingual data can then be transferred cross-lingually to select relevant data in the other language.We sample 500K sentence pairs from general domain data and select 50K sentence pairs from the sampled dataset with cross-lingual data selection.However, the selected examples are mostly noisy and not relevant to the target domain (here, noisy questions).This can be attributed to the fact that our target domain dataset contains noise, resulting in improper data selection.Consequently, this method deteriorated the results compared to baselines.As training progresses, we observe a drop in validation set results.This decline occurs because the model's performance degrades over longer training with noisy data.

Domain Adaptation with BLEU
Similar to the MLE method, we generate synthetic references from baseline and robust baseline models.We use 1-BLEU as the loss function and train with the MRT method.We achieve 1.7 BLEU and 0.9 TER improvements over the baseline.However, it is important to note that the improvements vary across different metrics when comparing this method to MLE.Using the robust baseline helped to achieve superior performance due to improved synthetic references.We achieve 0.6 and 0.2 BLEU and TER improvement, while BLEURT remained the same.Note that improvements achieved by this method w.r.t. the corresponding MLE based methods are statistically insignificant (Koehn, 2004), which can be attributed to the noisy nature of synthetic references.

BERTScore and MLM Loss
We report the results of our proposed method in Table 3 and Table 8.We achieve 47.2 BLEU, 35.8 TER, and 0.7646 BLEURT scores.Our method outperforms the baseline model by 3.4 BLEU points and improves TER by 3.6 points.Further, this method outperforms the MLE model trained with synthetic data by 1.9 BLEU score and 2.2 TER points and MRT with BLEU as the loss function by similar margins.Upon using a robust model as a baseline, we achieve 46.8 and 35.4 TER scores, which is 1.1 BLEU and 2.4 TER improvements compared to the MLE method.We perform a statistical significance test between the outputs of this method and the outputs of the corresponding MLE method with the Moses Toolkit (Koehn, 2004;Koehn et al., 2007).We found that the improvements are statistically significant, with p-values of 0.002 and 0.03 for non-robust and robust models, respectively, with respect to the corresponding MLE fine-tuned models.It is worth noting that when using a robust baseline for our proposed method, the BLEU score decreases by a small margin (-0.4 BLEU).In contrast, when using MRT with the MLE method, the BLEU score increases by 0.4.This suggests that robust pre-training has a limited effect when fine-tuning is performed on noisy data.

Results on Mintaka dataset
To simulate a scenario where only source data is available, we refrain from using the target-side of the data during training.Consequently, we compare the models with pre-trained models.Given that the questions are typically grammatical and the question mark is present in the source, we do not need to add it explicitly.Our method leads to a notable improvement of 1.0 BLEU points for English-Hindi compared to the baseline.However, the improvement is more modest, with just 0.3 BLEU points for English-German.Both English and German are considered high-resource languages, and the baseline model is trained on a large dataset.
Therefore, the baseline model can accurately translate most of the questions, given that the sentences in the Mintaka dataset are non-noisy.This limits the potential for improvement over a strong baseline when a parallel corpus is unavailable.

Analysis
We have observed that robust pre-training significantly improves our results.Nevertheless, the degree of improvement diminishes after fine-tuning, as both robust and non-robust baselines are finetuned with noisy data and learn to handle noise to a similar extent.We manually inspect sentences and check how our proposed method improves the performance.We provide one example in Table 4.Note that the source sentence is grammatically incorrect.First, the sentence contains is instead of it.
Further, it contains a full stop instead of a question mark.Baseline and MLE models were unable to handle it.However, our method was able to generate the correct translation.Note that although our model generated correct outputs in many such  instances, there exists a large number of samples where the model was unable to generate questionlike translation.We observed that, even though our model was able to increase the probability of question-like candidates, often it is still lower than statement-like candidates.We suggest that this is due to MLE pre-training.We would like to explore to avoid this in our future work.

Limitation
This method should be preferred when there is very little or no high-quality parallel corpus available for domain adaptation.In a non-noisy situation, it might be more effective to use a robust model to generate synthetic data.Note that the proposed loss function has high variance due to the presence of MLM score, and checkpoints should be frequently saved to get the optimal results.The loss functions rely on BERTScore and MLM models, which are known to be subject to biases (Sun et al., 2022;Jentzsch and Turan, 2022;Zhang and Hashimoto, 2021) that can propagate to the NMT model.While we did not observe such instances in our limited studies, it is important to remain vigilant about potential biases.It is essential to exercise caution when applying this method in domains where a mistranslation could have severe consequences, such as medical question-answering portals.We believe that the general concept presented in this paper may have relevance for other generative tasks that require balancing different aspects of the outputs.While exploring this is beyond the scope of our current work, it is a direction we plan to investigate in the future.

Conclusion
We have developed a robust NMT system tailored for translating questions.Our focus is on addressing the unique challenges posed by noisy questions, which are often presented in the form of statements due to the limited grammatical knowledge of users.
The MLE-based fine-tuning with synthetic data has several limitations, specifically when the source is noisy.We propose an MLM and BERTScore-based training method to balance adequacy and fluency instead of using synthetic references for training data.Our method improves translations for noisy questions compared to MLE fine-tuning with synthetic data, and it also enhances translations on non-noisy data compared to the pre-trained model.We achieve up to 47.2 BLEU and 35.4 TER scores, based on different settings.We conducted human evaluations with annotators from an E-commerce organization and observed a clear improvement in translation quality.We believe that the approach of balancing fluency and adequacy during training can be applied to other domains and languages.In the future, we plan to explore the use of Quality Estimation metrics, capable of scoring both fluency and adequacy during training.Further, we would like to explore extending the method for other lowresource languages.

Ethical Declaration
We have used publicly available datasets and content from CQA portals for training purposes, ensuring compliance with copyright regulations.To our knowledge, no personal information has been utilized in our training data.It is important to note that while our procedure has potential benefits, it is not entirely foolproof and should be used with moderation, and we have highlighted its limitations in our paper.

D Choice of Evaluation Metrics
We use BLEU and TER as these two are the most popular metrics, which often but not always correlate with human judgment.Recent metrics like COMET (Bosselut et al., 2019) and COMET-QE have shown very promising co-relations with human judgment.However, the COMET metric is based on source, hypotheses, and reference, while COMET-QE is based on source and hypotheses.Since the source is noisy, deep learning based metric, which depends upon token embedding, will not be able to generate faithful results due to noisy embeddings.Therefore we use BLEURT as the third metric for evaluation since it depends upon hypotheses and reference, and references for the test set is manually created by human annotators.

F Using GEC during Inference
We also try to observe if the GEC model can be used during testing to achieve even better results.Firstly, we add question marks to the end of sentences where it is absent.Then, we pass it to the GEC model, and finally, we pass its output to the NMT model.In Table 9, we report our results and observe mixed improvement in the results.Note that the difference in the bracket indicates the improvement when GEC is not used during testing time.Since the GEC model is not exclusively trained on questions, it tends to remove question marks from sentences to make them more like statements.We suggest that training a GEC model exclusively on questions could improve the results.However, it is difficult to train a highquality GEC model for questions since the size of question datasets is much lower compared to general-domain data.

G Example of Candidates with Beam Search During Testing
In Figure 2, we show the beam search tree of two of our models.The numbers in the EOS nodes indicate the log-likelihood of the path.Note that with MLE, most of the candidates are like statements.Only one candidate appears in the top 5 (last branch), which ranks fourth among the top five candidates.However, with our method, the questionlike candidate has a much higher probability compared to other sentences.The top (first branch) and fifth candidate (third branch) are question-like, and the fourth one (second branch) is partially like a question.

Figure 1 :
Figure 1: An abstract flow diagram of the training process

Algorithm 1
Our Proposed Training Procedure 1: MODEL ← Pre-trained model; GEC ← A source-side grammatical error correction model 2: MLM ← A target-side/multilingual masked language model 3: index=0; K= Beam width for Minimum Risk Training 4: for input sentence x ∈ batch do 220 Figure 2: Beam search tree of does is support on hyundai i10. on the models with beam width=5

Table 2 :
Example illustrating the impact of different MLM scores when sentences are grammatically incorrect for a question.The MLM score of the first candidate is -1.46, but it is not appropriate for a question.To address this, we removed the period "।" and added a question mark "?" to the candidate in the second row, resulting in an MLM score of -2.58.Thus, the model is encouraged to generate candidates that are appropriate for questions, as seen in the third row, to achieve a more favorable loss.wecalculate the recall as outlined in Equation4.Essentially, this metric represents the sum of the highest similarity score of the most similar word in the candidate translation for each word in the source sentence.It is worth noting that since the vectors are pre-normalized, calculation of ||x i || and ||ŷ i || are not required in the cosine similarity formula. 7)

Table 3 :
Results of our method on Flipkart QnA corpus (En-Hi)

Table 4 :
An Example of translation generated by our NMT models

Table 5 :
Results of our method on Mintaka dataset match and Words Missed are reduced with our proposed models.Minor errors, such as Bad choice of words and acronym/abbreviation got transliterated increased because the sentences that were producing critical errors with other models are now producing minor errors with our proposed model.

Table 9 :
Results of our method on questions dataset after adding question marks and then passing to GEC model during Testing