Characterizing Text Datasets with Psycholinguistic Features

Marcio Monteiro, Charu Karakkaparambil James, Marius Kloft, Sophie Fellenz


Abstract
Fine-tuning pretrained language models on task-specific data is a common practice in Natural Language Processing (NLP) applications. However, the number of pretrained models available to choose from can be very large, and it remains unclear how to select the optimal model without spending considerable amounts of computational resources, especially for the text domain. To address this problem, we introduce PsyMatrix, a novel framework designed to efficiently characterize text datasets. PsyMatrix evaluates multiple dimensions of text and discourse, producing interpretable, low-dimensional embeddings. Our framework has been tested using a meta-dataset repository that includes the performance of 24 pretrained large language models fine-tuned across 146 classification datasets. Using the proposed embeddings, we successfully developed a meta-learning system capable of recommending the most effective pretrained models (optimal and near-optimal) for fine-tuning on new datasets.
Anthology ID:
2024.findings-emnlp.880
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14977–14990
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.880
DOI:
Bibkey:
Cite (ACL):
Marcio Monteiro, Charu Karakkaparambil James, Marius Kloft, and Sophie Fellenz. 2024. Characterizing Text Datasets with Psycholinguistic Features. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 14977–14990, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Characterizing Text Datasets with Psycholinguistic Features (Monteiro et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.880.pdf