Training Text-to-Text Transformers with Privacy Guarantees

Natalia Ponomareva, Jasmijn Bastings, Sergei Vassilvitskii


Abstract
Recent advances in NLP often stem from large transformer-based pre-trained models, which rapidly grow in size and use more and more training data. Such models are often released to the public so that end users can fine-tune them on a task dataset. While it is common to treat pre-training data as public, it may still contain personally identifiable information (PII), such as names, phone numbers, and copyrighted material. Recent findings show that the capacity of these models allows them to memorize parts of the training data, and suggest differentially private (DP) training as a potential mitigation. While there is recent work on DP fine-tuning of NLP models, the effects of DP pre-training are less well understood: it is not clear how downstream performance is affected by DP pre-training, and whether DP pre-training mitigates some of the memorization concerns. We focus on T5 and show that by using recent advances in JAX and XLA we can train models with DP that do not suffer a large drop in pre-training utility, nor in training speed, and can still be fine-tuned to high accuracies on downstream tasks (e.g. GLUE). Moreover, we show that T5’s span corruption is a good defense against data memorization.
Anthology ID:
2022.findings-acl.171
Volume:
Findings of the Association for Computational Linguistics: ACL 2022
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2182–2193
Language:
URL:
https://aclanthology.org/2022.findings-acl.171
DOI:
10.18653/v1/2022.findings-acl.171
Bibkey:
Cite (ACL):
Natalia Ponomareva, Jasmijn Bastings, and Sergei Vassilvitskii. 2022. Training Text-to-Text Transformers with Privacy Guarantees. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2182–2193, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Training Text-to-Text Transformers with Privacy Guarantees (Ponomareva et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-acl.171.pdf