Ajay Patel


2023

pdf bib
Multilingual Bidirectional Unsupervised Translation through Multilingual Finetuning and Back-Translation
Bryan Li | Mohammad Sadegh Rasooli | Ajay Patel | Chris Callison-burch
Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)

We propose a two-stage approach for training a single NMT model to translate unseen languages both to and from English. For the first stage, we initialize an encoder-decoder model to pretrained XLM-R and RoBERTa weights, then perform multilingual fine-tuning on parallel data in 40 languages to English. We find this model can generalize to zero-shot translations on unseen languages. For the second stage, we leverage this generalization ability to generate synthetic parallel data from monolingual datasets, then bidirectionally train with successive rounds of back-translation. Our approach, which we EcXTra (uE/unglish-uc/uentric Crosslingual (uX/u) uTra/unsfer), is conceptually simple, only using a standard cross-entropy objective throughout. It is also data-driven, sequentially leveraging auxiliary parallel data and monolingual data. We evaluate unsupervised NMT results for 7 low-resource languages, and find that each round of back-translation training further refines bidirectional performance. Our final single EcXTra-trained model achieves competitive translation performance in all translation directions, notably establishing a new state-of-the-art for English-to-Kazakh (22.9 10.4 BLEU). Our code is available at [this URL](https://github.com/manestay/EcXTra).

pdf bib
Learning Interpretable Style Embeddings via Prompting LLMs
Ajay Patel | Delip Rao | Ansh Kothary | Kathleen McKeown | Chris Callison-Burch
Findings of the Association for Computational Linguistics: EMNLP 2023

Style representation learning builds content-independent representations of author style in text. To date, no large dataset of texts with stylometric annotations on a wide range of style dimensions has been compiled, perhaps because the linguistic expertise to perform such annotation would be prohibitively expensive. Therefore, current style representation approaches make use of unsupervised neural methods to disentangle style from content to create style vectors. These approaches, however, result in uninterpretable representations, complicating their usage in downstream applications like authorship attribution where auditing and explainability is critical. In this work, we use prompting to perform stylometry on a large number of texts to generate a synthetic stylometry dataset. We use this synthetic data to then train human-interpretable style representations we call LISA embeddings. We release our synthetic dataset (StyleGenome) and our interpretable style embedding model (LISA) as resources.

2021

pdf bib
TopGuNN: Fast NLP Training Data Augmentation using Large Corpora
Rebecca Iglesias-Flores | Megha Mishra | Ajay Patel | Akanksha Malhotra | Reno Kriz | Martha Palmer | Chris Callison-Burch
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances

Acquiring training data for natural language processing systems can be expensive and time-consuming. Given a few training examples crafted by experts, large corpora can be mined for thousands of semantically similar examples that provide useful variability to improve model generalization. We present TopGuNN, a fast contextualized k-NN retrieval system that can efficiently index and search over contextual embeddings generated from large corpora. TopGuNN is demonstrated for a training data augmentation use case over the Gigaword corpus. Using approximate k-NN and an efficient architecture, TopGuNN performs queries over an embedding space of 4.63TB (approximately 1.5B embeddings) in less than a day.

2018

pdf bib
Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package
Ajay Patel | Alexander Sands | Chris Callison-Burch | Marianna Apidianaki
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Vector space embedding models like word2vec, GloVe, and fastText are extremely popular representations in natural language processing (NLP) applications. We present Magnitude, a fast, lightweight tool for utilizing and processing embeddings. Magnitude is an open source Python package with a compact vector storage file format that allows for efficient manipulation of huge numbers of embeddings. Magnitude performs common operations up to 60 to 6,000 times faster than Gensim. Magnitude introduces several novel features for improved robustness like out-of-vocabulary lookups.