Data Factors for Better Compositional Generalization

Xiang Zhou, Yichen Jiang, Mohit Bansal


Abstract
Recent diagnostic datasets on compositional generalization, such as SCAN (Lake and Baroni, 2018) and COGS (Kim and Linzen, 2020), expose severe problems in models trained from scratch on these datasets. However, in contrast to this poor performance, state-of-the-art models trained on larger and more general datasets show better generalization ability. In this work, to reconcile this inconsistency, we conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors, including dataset scale, pattern complexity, example difficulty, etc. First, we show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges. To further understand this improvement, we show two axes of the benefit from more complex datasets: they provide more diverse examples so compositional understanding becomes more effective, and they also prevent ungeneralizable memorization of the examples due to reduced example repetition frequency. Finally, we explore how training examples of different difficulty levels influence generalization differently. On synthetic datasets, simple examples invoke stronger compositionality than hard examples do. On larger-scale real language datasets, while hard examples become more important potentially to ensure decent data coverage, a balanced mixture of simple and hard examples manages to induce the strongest generalizability.
Anthology ID:
2023.emnlp-main.898
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14549–14566
Language:
URL:
https://aclanthology.org/2023.emnlp-main.898
DOI:
10.18653/v1/2023.emnlp-main.898
Bibkey:
Cite (ACL):
Xiang Zhou, Yichen Jiang, and Mohit Bansal. 2023. Data Factors for Better Compositional Generalization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14549–14566, Singapore. Association for Computational Linguistics.
Cite (Informal):
Data Factors for Better Compositional Generalization (Zhou et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.898.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.898.mp4