Large-scale Machine Translation for Indian Languages in E-commerce under Low Resource Constraints

The democratization of e-commerce platforms has moved an increasingly diversified Indian user base to shop online. We have deployed reliable and precise large-scale Machine Translation systems for several Indian regional languages in this work. Building such systems is a challenge because of the low-resource nature of the Indian languages. We develop a structured model development pipeline as a closed feedback loop with external manual feedback through an Active Learning component. We show strong synthetic parallel data generation capability and consistent improvements to the model over iterations. Starting with 1.2M parallel pairs for English-Hindi we have compiled a corpus with 400M+ synthetic high quality parallel pairs across different domains. Further, we need colloquial translations to preserve the intent and friendliness of English content in regional languages, and make it easier to understand for our users. We perform robust and effective domain adaptation steps to achieve colloquial such translations. Over iterations, we show 9.02 BLEU points improvement for English to Hindi translation model. Along with Hindi, we show that the overall approach and best practices extends well to other Indian languages, resulting in deployment of our models across 7 Indian Languages.


Introduction
As one of the largest e-commerce platform, we support a very diverse user base in terms of regional languages.Product Descriptions, Catalog Attributes, and Product Reviews help customers understand and compare various products available on the platform.For the growing user-base in India with non-English background, providing this information in regional Indian languages makes their shopping experience more informative and friendly.With only 10% of the Indian population being versed in English1 , vernacular support is vital for the platform and its diverse users.In this work, we develop Machine Translation System to translate the available product data from English to regional languages to address this problem.Given the size of the Product Catalog and user base, the volume of the data to be translated is in the order of 100s of millions.This poses a challenge to build Translation Systems that are robust, reliable, and precise at scale.
The low resource nature of Indian Languages2 is another challenge for data-hungry deep networks such as Transformer (Vaswani et al., 2017).Given a large enough parallel corpus, the Transformer model can learn the inter-lingual mappings very well, even for very long sequences.These models can generate human-level precision translations for some resource-rich European languages (Popel et al., 2020).So theoretically, if we can get a large enough parallel corpus for Indian languages, we can solve the Automatic Machine Translation for Indian languages.
We build a training pipeline that can take monolingual corpus(abundantly available from public and in-house sources) and generate a high-quality synthetic parallel corpus.This is an efficient and effective approach, especially when paired with the Active Learning component over model iterations.For Hindi, starting with 1.2M parallel examples, we have compiled over 400M synthetic parallel examples with numerous model iterations.
Translation is an inherently one-to-many task where a single text can have various correct translations.The domain gap between the e-commerce domain and public domain (news, government sites, Wikipedia, books, etc.) is significant.To showcase this, Figure 1 has colloquial and non-colloquial Hindi translations for a source sentence in English.
Both of these translations are correct, but as an e-commerce platform, we refrain from using noncolloquial and infrequently used words as it decreases the appeal of the information from the colloquial e-commerce English domain.
Based on the final training steps, translation models can generate appropriate translations at inference.We fine-tune the model only using the colloquial in-domain data with robust domain adaptation steps to get more colloquial translations.
Our contributions in this paper are as follows: • Synthetic Parallel Corpus Generation: With the help of sub-modules, we generate a vast amount of high-quality parallel corpus solving for low-resource Indian Languages.
• Iterative Model Training Pipeline: With the help of data cleaning and filtering modules, we showcase how we iteratively improve the Translation models significantly with Active Learning steps.
• Large-Scale High Precision and Colloquial Models: Finally, we provide large-scale Machine Translation models with high precision and domain-adapted colloquial styles for several Indian Languages.

Related Work
Transformers (Vaswani et al., 2017) are widely used architecture for seq2seq tasks.Along with Unigram-based subword tokens, the fully attentionbased model performs very well for Translation tasks, even for longer sequences.Translation is a well-explored area, and even for low-resource settings, significant work has already been done.Along the lines of data gathering -collecting parallel corpora (Ramesh et al., 2021), mining multilingual sets and retrieving parallel entries (Tran et al., 2020), iterative cross-lingual alignments (Philip et al., 2021) has been explored.Zhang et al. (2020), showed parallel corpus filtering on web crawled data.
Transfer Learning is also a convenient approach to improve final model performance in low resource settings.(Rothe et al., 2020) explored leveraging large language models trained on unlabelled data for translation tasks.This approach works well only if we have strong pre-trained models.For Indian language settings, this is typically not the case.Also, synthetic data generation is very inefficient without active learning.(Imankulova et al., 2019) shows that translation models can help with pseudo labeling, but this improvement saturates without external feedback.(Peris and Casacuberta, 2018) has explored an active learning framework for machine translation.Gupta et al. (2021) investigate the active learning methods for Machine Translation in Indian Languages settings.Lample et al. (2017) even shows that completely unsupervised Machine Translation is possible using just monolingual data.But these practices don't work in large-scale settings.Given a large amount of good quality parallel data, supervised methods still beat other weak methods.Especially for production settings, there has not been much exploration done at large-scale systems starting with low resource settings.

Overall Pipeline
We use Transformer encoder-decoder model with 6 encoder layers and 6 decoder layers with hidden size of 512.We use 32,000 unigram subword tokens trained on data from all domains.This configuration has 93M parameters.As a pre-processing step, we split long paragraphs into sentences and translate independent sentences using the Transformer model.

Datasets Used
We use all publicly available parallel corpus from various domains within commercial licensing restrictions.Also, we conduct internal operations for parallel corpus creation for in-domain sampled datasets from the Product Descriptions, Catalog Attributes, Search, and Product Reviews.This operation is costly and is only done with the Active Learning step.Apart from parallel corpus, our pipeline heavily relies on synthetic data generation, for which we use publicly available monolingual corpus from general domain compiled from various sources (Wenzek et al., 2020;Abadji et al., 2022;Barrault et al., 2019)

Monolingual Data Processing
To generate synthetic parallel corpus, we use wellknown Back-Translation methods (Sennrich et al., 2016) to translate Indic monolingual corpus back to English.The corpus we use from the public domain is already curated and cleaned.So cleaning Indic monolingual data is easy with some basic textnormalization, rare character filtering, punctuation fixes, etc. Apart from back-translations, we also use Forward Translations, where we translate monolingual source text to the target language with an imperfect translation model.Forward Translations are a crucial part of our training pipeline.But the quality of synthetic pairs heavily impacts the final model training.Domains such as English Reviews and Search queries can be very noisy with spelling errors, punctuation errors, case errors, etc.While generating translations for these noisy entries, the quality of the translations is limited by the noisy input itself.Hence before using monolingual data for synthetic parallel data generation, we filter out unclean English texts from the corpus using the pipeline as shown in Fig. 2. We use BERT based classifier model to detect noisy texts from the monolingual corpus.To improve the Translation model robustness, we (1) Correct some of the noisy data filtered out from monolingual corpus to get translations even for noisy text inputs and (2) introduce noise to already clean input texts.We use In-House Transformer-based Encoder-Decoder Spell Correction models to correct unclean texts for search queries and reviews.And as spell correct models have low precision (benchmarks detailed in table 2) we again filter out unclean data from spell corrected set as shown in Fig. 2

Translation Quality Estimation
We monitor and filter imperfect parallel pairs with two methods: • Translation model Uncertainty score: The transformer model uses predictions from a softmax layer to generate each token output.The output of this layer is the probability distribution over the vocabulary for each token.
When the probability of the predicted token is low -the model is more uncertain about the token prediction and vice versa.We aggregate this metric over the entire output sequence and normalize it for the output length to get a final uncertainty score for a translation.As the Translation model Uncertainty score can still be biased toward the erroneously predicted tokens, the independent translation scoring is a good supplement for data filtering.We use an ensemble of two translation quality estimation methods and reject translations setting up high rejection recall.The evaluation scores for both models and ensemble are detailed in Table 3.The final filtered data counts are detailed in table 4 for Hindi language.As expected, rejected synthetic translations are very high for search and reviews set as the stream has very noisy inputs.

Pipeline with Active Learning
To generate the synthetic parallel corpus and train Translation models using this corpus, we use a pipeline demonstrated in figure 3 in an iterative manner.The detailed algorithm is mentioned in Algo. 1.Translation Quality Estimation module, is pooled, and a diverse batch is sampled from this set to get corrected by manual annotators.This batched Active Learning is crucial in the iteration and makes the forward translations feasible.While re-training the model in the next model iteration, we have filtered good synthetic translations generated by the model and manual translations instead of imperfect translations the model produces.This is an overall translation corpus quality update; hence we train improved Translation models in each iteration.

Domain Adaptation
As we need colloquial translations in the output, we have to fine-tune the pre-trained models on all domain corpus using just the in-domain colloquial dataset.As evident from Table 5, BLEU scores jump sharply when the model is fine-tuned on the in-domain small training set.This shows that domain gap with general domain and e-commerce colloquial domain is significant.In-domain Forward Translations(forward translated in-domain monolingual corpus) are crucial in this step as the cleaned  and filtered high-quality forward translations help bridge the domain gap and provide much more capable pre-trained models.This ensures that the model does not go through over-fitting or catastrophic forgetting(for the in-domain set), and we get a more robust and reliable model at scale.Table 1 has some examples where our model produces more colloquial translations and refrains from using non-colloquial and non-friendly Hindi words.

Model Iterations
As evident from

Results and Discussion
We benchmark our models on manually annotated Product Descriptions(PD) test set along with public Indic WAT21 benchmark (Nakazawa et al., 2021) in table 6.We consistently show better BLEU scores on all test sets than Public translation API(Google).We define the Translation Accuracy i.e., the rate at which the translation is acceptable with only minor errors(percent excluding bad cases), is very high across all languages.This allows us to de-ploy the Translation Systems in large-scale, highly precise settings.Table 7 shows the exact figures for manual evaluation for English to Hindi catalog translations.Our models show remarkably low bad translation cases and very high, (> 50%) gold standard translations.The huge domain gap between e-commerce and general domains leads to poor evaluation results for Google as it produces consistent non-colloquial words and is not adapted to the domain.
The training pipeline has consistently shown better translation models throughout the model iterations paired with Active Learning, adding more monolingual data and filtered high-quality synthetic parallel translations.As evident from the plot 4, the addition of more synthetic data in pretraining, the addition of forward translation for pretraining as well as domain adaptation, model retraining from scratch with higher quality corpus and better pipeline sub-modules, and active learning steps show very significant improvements at each stage.Starting from 32.3 BLEU score, we have reached 41.32 BLEU score, which is a massive improvement just using a few active learning steps and synthetic corpus updates.

Deployment and Business Impact
Currently the Translation models for all the languages are deployed in batch-prediction mode on CPU inference system.While translating the catalog data or updating the translation models, we trigger the deployment pipeline and update the offline batch-predictions in the database.
The primary metric used to determine the impact of this deployment is conversion and cost savings.We have seen +11 bps improvement in conversion and significant cost savings through 100% automated translations via our system across various languages.

Conclusion
In this work we have shown that synthetic parallel corpus generation and data filtering is a viable option to train large-scale translation models in lowresource settings.Also we show that Active Learning can consistently improve the model.We build very robust, large-scale models which work very precisely on our In-domain data and also outperform Google on public general domain benchmarks consistently.We also show how building colloquial models are important for ease of understanding, and we also show that our overall approach and best practices extend well to multiple Indian languages.

Limitations
The proposed training pipeline heavily relies on synthetic translations.In some cases(for example, Assamese has only <1M monolingual text), there is not enough data, and the initial model itself can not be appropriately trained, which makes the entire pipeline ineffective.Data efficiency is a considerable challenge in low-resource settings.
The pipeline uses several Language Model based sub-modules for data-cleaning, translation quality estimation, etc., which also impact the pipeline capability, and it might get cumbersome to manage and update many modules.

Figure 1 :
Figure 1: Both colloquial and non-colloquial translations are correct, but for E-commerce platform we need more colloquial translation styles.

•
Independent BERT based Quality estimation: Given a source and target sequence, we train a multilingual BERT(Devlin et al., 2018) based classifier, which predicts if the sequences are perfect parallel pairs.The BERTbased classifier model is trained on a set of correct translation pairs (pooled from available high-quality manual translation pairs) and noise-induced pairs from the correct translation pairs with levels of translation errors.To get the final translation score, we pass both the source-target and target-source combination of pairs to a pre-trained BERT encoder and use concatenated context for the classification head.quality_score(x 1..T , y 1..T ′ ) = h([B(x 1..T , y 1..T ′ ), B(y 1..T ′ , x 1..T )])(1)

Figure 3 :
Figure 3: Model Training and Synthetic Data Generation pipeline

Figure 4 :
Figure 4: Snapshot of selected models.Iterations vs Product Description BLEU scores for English-Hindi Translation.
. The details for the datasets are listed in table 1.
Table 1: Monolingual Datasets used in Synthetic Data Generation along with Parallel Corpus . We add pairs <noisy text, translation from cleaned corrected text> as the translation pairs in generated training data.

Table 2 :
Monolingual Data Cleaning.(Spell Correction rate is the percent of unclean text model corrects properly) Figure 2: Monolingual Data Cleaning and Spell Correction pipeline.

Table 4 :
Monolingual datasets used and back-translated or forward translated dataset size and filtered synthetic corpus size.

Table 5 :
Hindi Product Description BLEU scores Table 4, the BLEU scores are drastically improved in each synthetic data addition step.The best model is improved by +9.01 BLEU scores over the v1.1 model which does not use any synthetic corpus.The size of back-translated cor-

Table 6 :
BLEU scores comparing best public API and Manual Translation Accuracy for our Product Descriptions(PD) and Catalog Attributes.

Table 7 :
English to Hindi Translation evaluation for Product Descriptions(PD)