Pulkit Madaan
2024
Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains
Krithika Ramesh
|
Nupoor Gandhi
|
Pulkit Madaan
|
Lisa Bauer
|
Charith Peris
|
Anjalie Field
Findings of the Association for Computational Linguistics: EMNLP 2024
The difficulty of anonymizing text data hinders the development and deployment of NLP in high-stakes domains that involve private data, such as healthcare and social services. Poorly anonymized sensitive data cannot be easily shared with annotators or external researchers, nor can it be used to train public models. In this work, we explore the feasibility of using synthetic data generated from differentially private language models in place of real data to facilitate the development of NLP in these domains without compromising privacy. In contrast to prior work, we generate synthetic data for real high-stakes domains, and we propose and conduct use-inspired evaluations to assess data quality. Our results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data. Overall, our work underscores the need for further improvements to synthetic data generation for it to be a viable way to enable privacy-preserving data sharing.
2020
Multilingual Neural Machine Translation involving Indian Languages
Pulkit Madaan
|
Fatiha Sadat
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation
Neural Machine Translations (NMT) models are capable of translating a single bilingual pair and require a new model for each new language pair. Multilingual Neural Machine Translation models are capable of translating multiple language pairs, even pairs which it hasn’t seen before in training. Availability of parallel sentences is a known problem in machine translation. Multilingual NMT model leverages information from all the languages to improve itself and performs better. We propose a data augmentation technique that further improves this model profoundly. The technique helps achieve a jump of more than 15 points in BLEU score from the multilingual NMT model. A BLEU score of 36.2 was achieved for Sindhi–English translation, which is higher than any score on the leaderboard of the LoResMT SharedTask at MT Summit 2019, which provided the data for the experiments.
Search
Fix data
Co-authors
- Lisa Bauer 1
- Anjalie Field 1
- Nupoor Gandhi 1
- Charith Peris 1
- Krithika Ramesh 1
- show all...