Meet Doshi
2024
Pretraining Language Models Using Translationese
Meet Doshi
|
Raj Dabre
|
Pushpak Bhattacharyya
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
In this paper, we explore the utility of Translationese as synthetic data created using machine translation for pre-training language models (LMs) for low-resource languages (LRLs). Our simple methodology consists of translating large amounts of web-crawled monolingual documents (clean) into the LRLs, followed by filtering the translated documents using tiny LMs trained on small but clean LRL data. Taking the case of Indian languages, we pre-train LMs from scratch with 28M and 85M parameters, and then fine-tune them for 5 downstream natural language understanding (NLU) and 4 generative (NLG) tasks. We observe that pre-training on filtered synthetic data leads to relative performance drops of only 0.87% for NLU and 2.35% for NLG, compared to pre-training on clean data, and this gap further diminishes upon the inclusion of a small amount of clean data. We also study the impact of synthetic data filtering and the choice of source language for synthetic data generation. Furthermore, evaluating continually pre-trained larger models like Gemma-2B and Llama-3-8B in few-shot settings, we observe that using synthetic data is competitive with using clean data. Our findings suggest that synthetic data shows promise for bridging the pre-training gap between English and LRLs.
PUB: A Pragmatics Understanding Benchmark for Assessing LLMs’ Pragmatics Capabilities
Settaluri Sravanthi
|
Meet Doshi
|
Pavan Tankala
|
Rudra Murthy
|
Raj Dabre
|
Pushpak Bhattacharyya
Findings of the Association for Computational Linguistics: ACL 2024
LLMs have demonstrated remarkable capability for understanding semantics, but their understanding of pragmatics is not well studied. To this end, we release a Pragmatics Understanding Benchmark (PUB) dataset consisting of fourteen tasks in four pragmatics phenomena, namely; Implicature, Presupposition, Reference, and Deixis. We curate high-quality test sets for each task, consisting of Multiple Choice Question Answers (MCQA). PUB includes a total of 28k data points, 6.1k are newly annotated. We evaluate nine models varying in the number of parameters and type of training. Our study reveals several key observations about the pragmatic capabilities of LLMs: 1. chat-fine-tuning strongly benefits smaller models, 2. large base models are competitive with their chat-fine-tuned counterparts, 3. there is a huge variance in performance across different pragmatics phenomena, and 4. a noticeable performance gap between human capabilities and model capabilities. We hope that PUB will enable comprehensive evaluation of LLM’s pragmatic reasoning capabilities.
2023
Machine Translation Advancements for Low-Resource Indian Languages in WMT23: CFILT-IITB’s Effort for Bridging the Gap
Pranav Gaikwad
|
Meet Doshi
|
Sourabh Deoghare
|
Pushpak Bhattacharyya
Proceedings of the Eighth Conference on Machine Translation
This paper is related to the submission of the CFILT-IITB team for the task called IndicMT in WMT23. The paper describes our MT systems submitted to the WMT23 IndicMT shared task. The task focused on MT system development from/to English and four low-resource North-East Indian languages, viz., Assamese, Khasi, Manipuri, and Mizo. We trained them on a small parallel corpus resulting in poor-quality systems. Therefore, we utilize transfer learning with the help of a large pre-trained multilingual NMT system. Since this approach produced the best results, we submitted our NMT models for the shared task using this approach.
Search
Co-authors
- Pushpak Bhattacharyya 3
- Raj Dabre 2
- Settaluri Sravanthi 1
- Pavan Tankala 1
- Rudra Murthy 1
- show all...