Improving Large-scale Paraphrase Acquisition and Generation

Yao Dou, Chao Jiang, Wei Xu


Abstract
This paper addresses the quality issues in existing Twitter-based paraphrase datasets, and discusses the necessity of using two separate definitions of paraphrase for identification and generation tasks. We present a new Multi-Topic Paraphrase in Twitter (MultiPIT) corpus that consists of a total of 130k sentence pairs with crowdsoursing (MultiPIT_crowd) and expert (MultiPIT_expert) annotations using two different paraphrase definitions for paraphrase identification, in addition to a multi-reference test set (MultiPIT_NMR) and a large automatically constructed training set (MultiPIT_Auto) for paraphrase generation. With improved data annotation quality and task-specific paraphrase definition, the best pre-trained language model fine-tuned on our dataset achieves the state-of-the-art performance of 84.2 F1 for automatic paraphrase identification. Furthermore, our empirical results also demonstrate that the paraphrase generation models trained on MultiPIT_Auto generate more diverse and high-quality paraphrases compared to their counterparts fine-tuned on other corpora such as Quora, MSCOCO, and ParaNMT.
Anthology ID:
2022.emnlp-main.631
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9301–9323
Language:
URL:
https://aclanthology.org/2022.emnlp-main.631
DOI:
10.18653/v1/2022.emnlp-main.631
Bibkey:
Cite (ACL):
Yao Dou, Chao Jiang, and Wei Xu. 2022. Improving Large-scale Paraphrase Acquisition and Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9301–9323, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Improving Large-scale Paraphrase Acquisition and Generation (Dou et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.631.pdf
Dataset:
 2022.emnlp-main.631.dataset.zip