Regularising Fisher Information Improves Cross-lingual Generalisation

Asa Cooper Stickland, Iain Murray


Abstract
Many recent works use ‘consistency regularisation’ to improve the generalisation of fine-tuned pre-trained models, both multilingual and English-only. These works encourage model outputs to be similar between a perturbed and normal version of the input, usually via penalising the Kullback–Leibler (KL) divergence between the probability distribution of the perturbed and normal model. We believe that consistency losses may be implicitly regularizing the loss landscape. In particular, we build on work hypothesising that implicitly or explicitly regularizing trace of the Fisher Information Matrix (FIM), amplifies the implicit bias of SGD to avoid memorization. Our initial results show both empirically and theoretically that consistency losses are related to the FIM, and show that the flat minima implied by a small trace of the FIM improves performance when fine-tuning a multilingual model on additional languages. We aim to confirm these initial results on more datasets, and use our insights to develop better multilingual fine-tuning techniques.
Anthology ID:
2021.mrl-1.20
Volume:
Proceedings of the 1st Workshop on Multilingual Representation Learning
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Venues:
EMNLP | MRL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
238–241
Language:
URL:
https://aclanthology.org/2021.mrl-1.20
DOI:
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.mrl-1.20.pdf
Data
XNLI