Data Augmentation for Cross-domain Parsing via Lightweight LLM Generation and Tree Hybridization

Ziyan Zhang, Yang Hou, Chen Gong, Zhenghua Li


Abstract
Cross-domain constituency parsing remains a challenging task due to the lack of high-quality out-of-domain data. In this paper, we propose a data augmentation method via lightweight large language model (LLM) generation and tree hybridization. We utilize LLM to generate phrase structures (subtrees) for the target domain by incorporating grammar rules and lexical head information into the prompt. To better leverage LLM-generated target-domain subtrees, we hybridize them with existing source-domain subtrees to efficiently produce a large number of structurally diverse instances. Experimental results demonstrate that our method achieves significant improvements on five target domains with a lightweight LLM generation cost.
Anthology ID:
2025.coling-main.744
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11235–11247
Language:
URL:
https://aclanthology.org/2025.coling-main.744/
DOI:
Bibkey:
Cite (ACL):
Ziyan Zhang, Yang Hou, Chen Gong, and Zhenghua Li. 2025. Data Augmentation for Cross-domain Parsing via Lightweight LLM Generation and Tree Hybridization. In Proceedings of the 31st International Conference on Computational Linguistics, pages 11235–11247, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Data Augmentation for Cross-domain Parsing via Lightweight LLM Generation and Tree Hybridization (Zhang et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.744.pdf