Theresa Breiner
2022
Scaling Language Model Size in Cross-Device Federated Learning
Jae Ro
|
Theresa Breiner
|
Lara McConnaughey
|
Mingqing Chen
|
Ananda Suresh
|
Shankar Kumar
|
Rajiv Mathews
Proceedings of the First Workshop on Federated Learning for Natural Language Processing (FL4NLP 2022)
Most studies in cross-device federated learning focus on small models, due to the server-client communication and on-device computation bottlenecks. In this work, we leverage various techniques for mitigating these bottlenecks to train larger language models in cross-device federated learning. With systematic applications of partial model training, quantization, efficient transfer learning, and communication-efficient optimizers, we are able to train a 21M parameter Transformer that achieves the same perplexity as that of a similarly sized LSTM with ∼10× smaller client-to-server communication cost and 11% lower perplexity than smaller LSTMs commonly studied in literature.
2020
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
Isaac Caswell
|
Theresa Breiner
|
Daan van Esch
|
Ankur Bapna
Proceedings of the 28th International Conference on Computational Linguistics
Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.
Search
Fix data
Co-authors
- Ankur Bapna 1
- Isaac Caswell 1
- Mingqing Chen 1
- Shankar Kumar 1
- Rajiv Mathews 1
- show all...