Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing

Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada


Abstract
This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format. We instruction-tune local LLMs as universal DP task solvers that operate on a local, single, and low-priced GPU, ensuring data security and enabling further customization. We select a collection of datasets across four representative DP tasks and construct instruction data using data configuration, knowledge injection, and reasoning data distillation techniques tailored to DP. By tuning Mistral-7B, Llama 3-8B, and OpenOrca-Platypus2-13B, our models, Jellyfish-7B/8B/13B, deliver competitiveness compared to GPT-3.5/4 models and strong generalizability to unseen tasks while barely compromising the base models’ abilities in NLP tasks. Meanwhile, Jellyfish offers enhanced reasoning capabilities compared to GPT-3.5. Our models are available at: https://huggingface.co/NECOUDBFM/JellyfishOur instruction dataset is available at: https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct
Anthology ID:
2024.emnlp-main.497
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8754–8782
Language:
URL:
https://aclanthology.org/2024.emnlp-main.497
DOI:
Bibkey:
Cite (ACL):
Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada. 2024. Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8754–8782, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing (Zhang et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.497.pdf
Data:
 2024.emnlp-main.497.data.zip