PARADE: A New Dataset for Paraphrase Identification Requiring Computer Science Domain Knowledge

Yun He, Zhuoer Wang, Yin Zhang, Ruihong Huang, James Caverlee


Abstract
We present a new benchmark dataset called PARADE for paraphrase identification that requires specialized domain knowledge. PARADE contains paraphrases that overlap very little at the lexical and syntactic level but are semantically equivalent based on computer science domain knowledge, as well as non-paraphrases that overlap greatly at the lexical and syntactic level but are not semantically equivalent based on this domain knowledge. Experiments show that both state-of-the-art neural models and non-expert human annotators have poor performance on PARADE. For example, BERT after fine-tuning achieves an F1 score of 0.709, which is much lower than its performance on other paraphrase identification datasets. PARADE can serve as a resource for researchers interested in testing models that incorporate domain knowledge. We make our data and code freely available.
Anthology ID:
2020.emnlp-main.611
Volume:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:
November
Year:
2020
Address:
Online
Editors:
Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7572–7582
Language:
URL:
https://aclanthology.org/2020.emnlp-main.611
DOI:
10.18653/v1/2020.emnlp-main.611
Bibkey:
Cite (ACL):
Yun He, Zhuoer Wang, Yin Zhang, Ruihong Huang, and James Caverlee. 2020. PARADE: A New Dataset for Paraphrase Identification Requiring Computer Science Domain Knowledge. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7572–7582, Online. Association for Computational Linguistics.
Cite (Informal):
PARADE: A New Dataset for Paraphrase Identification Requiring Computer Science Domain Knowledge (He et al., EMNLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.emnlp-main.611.pdf
Video:
 https://slideslive.com/38939139
Code
 heyunh2015/PARADE_dataset
Data
PARADEGLUEMRPCPAWSPIT