DATA-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning

Yingqian Min, Kun Zhou, Dawei Gao, Xin Zhao, He Hu, Yaliang Li


Abstract
Recently, multi-task instruction tuning has been utilized to improve sentence representation learning (SRL). It enables SRL models to generate task-specific representations with the guidance of task instruction, thus exhibiting strong generalization ability on unseen tasks. However, these methods mostly neglect the potential interference problems across different tasks and instances, which may affect the training of the model.To address this issue, we propose a data curriculum method, namely **Data-CUBE**, that arranges the order of all the multi-task data for training, to minimize the interference risks from two aspects.At the task level, we aim to find the optimal task order to minimize the total cross-task interference risk and formulate this problem as the traveling salesman problem, which is further solved by a specially designed simulated annealing algorithm. At the instance level, we propose a measurement method to quantify the difficulty of all instances per task, and then arrange instances in an easy-to-difficult order for training.Experimental results show that our approach can boost the performance of state-of-the-art methods. Our code and data will be publicly released.
Anthology ID:
2024.findings-acl.816
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13748–13761
Language:
URL:
https://aclanthology.org/2024.findings-acl.816
DOI:
10.18653/v1/2024.findings-acl.816
Bibkey:
Cite (ACL):
Yingqian Min, Kun Zhou, Dawei Gao, Xin Zhao, He Hu, and Yaliang Li. 2024. DATA-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning. In Findings of the Association for Computational Linguistics ACL 2024, pages 13748–13761, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
DATA-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning (Min et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.816.pdf