2024
pdf
bib
abs
An Empirical Study of In-context Learning in LLMs for Machine Translation
Pranjal Chitale
|
Jay Gala
|
Raj Dabre
Findings of the Association for Computational Linguistics: ACL 2024
Recent interest has surged in employing Large Language Models (LLMs) for machine translation (MT) via in-context learning (ICL) (Vilar et al., 2023). Most prior studies primarily focus on optimizing translation quality, with limited attention to understanding the specific aspects of ICL that influence the said quality. To this end, we perform the first of its kind, exhaustive study of in-context learning for machine translation (MT). We first establish that ICL is primarily example-driven and not instruction-driven. Following this, we conduct an extensive exploration of various aspects of the examples to understand their influence on downstream performance. Our analysis includes factors such as quality and quantity of demonstrations, spatial proximity, and source versus target originality. Further, we also investigate challenging scenarios involving indirectness and misalignment of examples to understand the limits of ICL. While we establish the significance of the quality of the target distribution over the source distribution of demonstrations, we further observe that perturbations sometimes act as regularizers, resulting in performance improvements. Surprisingly, ICL does not necessitate examples from the same task, and a related task with the same target distribution proves sufficient. We hope that our study acts as a guiding resource for considerations in utilizing ICL for MT. Our code is available on https://github.com/PranjalChitale/in-context-mt-analysis.
pdf
bib
abs
Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning
Everlyn Chimoto
|
Jay Gala
|
Orevaoghene Ahia
|
Julia Kreutzer
|
Bruce Bassett
|
Sara Hooker
Findings of the Association for Computational Linguistics: ACL 2024
Neural Machine Translation models are extremely data and compute-hungry. However, not all datapoints contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significantdrop in model performance. In this paper, we propose a new data pruning technique: CheckpointsAcross Time (CAT ), that leverages early model training dynamics to identify the most relevantdata points for model performance. We benchmark CAT against several data pruning techniquesincluding COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks onIndo-European languages on multiple test sets. When applied to English-German, English-Frenchand English-Swahili translation tasks, CAT achieves comparable performance to using the fulldataset, while pruning up to 50% of training data. We inspect the data points that CAT selectsand find that it tends to favour longer sentences and sentences with unique or rare words.
pdf
bib
abs
RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization
Jaavid J
|
Raj Dabre
|
Aswanth M
|
Jay Gala
|
Thanmay Jayakumar
|
Ratish Puduppully
|
Anoop Kunchukuttan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This study addresses the challenge of extending Large Language Models (LLMs) to non-English languages, specifically those using non-Roman scripts. We propose an approach that utilizes the romanized form of text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. Our approach involve the continual pretraining of a English LLM like Llama 2 on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. The results indicate that romanized text not only reduces token fertility by 2x-4x but also matches if not outperforms native script representation across various NLU, NLG and MT tasks. Moreover, the embeddings computed on romanized text exhibit closer alignment with their English translations than those from the native script. Our approach presents a promising direction for leveraging the power of English LLMs in languages traditionally underrepresented in NLP research.
2023
pdf
bib
abs
A Federated Approach for Hate Speech Detection
Jay Gala
|
Deep Gandhi
|
Jash Mehta
|
Zeerak Talat
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Hate speech detection has been the subject of high research attention, due to the scale of content created on social media. In spite of the attention and the sensitive nature of the task, privacy preservation in hate speech detection has remained under-studied. The majority of research has focused on centralised machine learning infrastructures which risk leaking data. In this paper, we show that using federated machine learning can help address privacy the concerns that are inherent to hate speech detection while obtaining up to 6.81% improvement in terms of F1-score.
pdf
bib
Developing State-Of-The-Art Massively Multilingual Machine Translation Systems for Related Languages
Jay Gala
|
Pranjal A. Chitale
|
Raj Dabre
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract
pdf
bib
abs
NICT-AI4B’s Submission to the Indic MT Shared Task in WMT 2023
Raj Dabre
|
Jay Gala
|
Pranjal A. Chitale
Proceedings of the Eighth Conference on Machine Translation
In this paper, we (Team NICT-AI4B) describe our MT systems that we submit to the Indic MT task in WMT 2023. Our primary system consists of 3 stages: Joint denoising and MT training using officially approved monolingual and parallel corpora, backtranslation and, MT training on original and backtranslated parallel corpora. We observe that backtranslation leads to substantial improvements in translation quality up to 4 BLEU points. We also develop 2 contrastive systems on unconstrained settings, where the first system involves fine-tuning of IndicTrans2 DA models on official parallel corpora and seed data used in AI4Bharat et al, (2023), and the second system involves a system combination of the primary and the aforementioned system. Overall, we manage to obtain high-quality translation systems for the 4 low-resource North-East Indian languages of focus.