An Empirical Study of Synthetic Data Generation for Implicit Discourse Relation Recognition

Kazumasa Omura, Fei Cheng, Sadao Kurohashi


Abstract
Implicit Discourse Relation Recognition (IDRR), which is the task of recognizing the semantic relation between given text spans that do not contain overt clues, is a long-standing and challenging problem. In particular, the paucity of training data for some error-prone discourse relations makes the problem even more challenging. To address this issue, we propose a method of generating synthetic data for IDRR using a large language model. The proposed method is summarized as two folds: extraction of confusing discourse relation pairs based on false negative rate and synthesis of data focused on the confusion. The key points of our proposed method are utilizing a confusion matrix and adopting two-stage prompting to obtain effective synthetic data. According to the proposed method, we generated synthetic data several times larger than training examples for some error-prone discourse relations and incorporated it into training. As a result of experiments, we achieved state-of-the-art macro-F1 performance thanks to the synthetic data without sacrificing micro-F1 performance and demonstrated its positive effects especially on recognizing some infrequent discourse relations.
Anthology ID:
2024.lrec-main.96
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
1073–1085
Language:
URL:
https://aclanthology.org/2024.lrec-main.96
DOI:
Bibkey:
Cite (ACL):
Kazumasa Omura, Fei Cheng, and Sadao Kurohashi. 2024. An Empirical Study of Synthetic Data Generation for Implicit Discourse Relation Recognition. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 1073–1085, Torino, Italia. ELRA and ICCL.
Cite (Informal):
An Empirical Study of Synthetic Data Generation for Implicit Discourse Relation Recognition (Omura et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.96.pdf