Filtered Corpus Training (FiCT) Shows that Language Models Can Generalize from Indirect Evidence

Abhinav Patil, Jaap Jumelet, Yu Ying Chiu, Andy Lapastora, Peter Shen, Lexie Wang, Clevis Willrich, Shane Steinert-Threlkeld


Abstract
This paper introduces Filtered Corpus Training, a method that trains language models (LMs) on corpora with certain linguistic constructions filtered out from the training data, and uses it to measure the ability of LMs to perform linguistic generalization on the basis of indirect evidence. We apply the method to both LSTM and Transformer LMs (of roughly comparable size), developing filtered corpora that target a wide range of linguistic phenomena. Our results show that while transformers are better qua LMs (as measured by perplexity), both models perform equally and surprisingly well on linguistic generalization measures, suggesting that they are capable of generalizing from indirect evidence.
Anthology ID:
2024.tacl-1.87
Volume:
Transactions of the Association for Computational Linguistics, Volume 12
Month:
Year:
2024
Address:
Cambridge, MA
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
1597–1615
Language:
URL:
https://aclanthology.org/2024.tacl-1.87/
DOI:
10.1162/tacl_a_00720
Bibkey:
Cite (ACL):
Abhinav Patil, Jaap Jumelet, Yu Ying Chiu, Andy Lapastora, Peter Shen, Lexie Wang, Clevis Willrich, and Shane Steinert-Threlkeld. 2024. Filtered Corpus Training (FiCT) Shows that Language Models Can Generalize from Indirect Evidence. Transactions of the Association for Computational Linguistics, 12:1597–1615.
Cite (Informal):
Filtered Corpus Training (FiCT) Shows that Language Models Can Generalize from Indirect Evidence (Patil et al., TACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.tacl-1.87.pdf