RuPAWS: A Russian Adversarial Dataset for Paraphrase Identification

Nikita Martynov, Irina Krotova, Varvara Logacheva, Alexander Panchenko, Olga Kozlova, Nikita Semenov


Abstract
Paraphrase identification task can be easily challenged by changing word order, e.g. as in “Can a good person become bad?”. While for English this problem was tackled by the PAWS dataset (Zhang et al., 2019), datasets for Russian paraphrase detection lack non-paraphrase examples with high lexical overlap. We present RuPAWS, the first adversarial dataset for Russian paraphrase identification. Our dataset consists of examples from PAWS translated to the Russian language and manually annotated by native speakers. We compare it to the largest available dataset for Russian ParaPhraser and show that the best available paraphrase identifiers for the Russian language fail on the RuPAWS dataset. At the same time, the state-of-the-art paraphrasing model RuBERT trained on both RuPAWS and ParaPhraser obtains high performance on the RuPAWS dataset while maintaining its accuracy on the ParaPhraser benchmark. We also show that RuPAWS can measure the sensitivity of models to word order and syntax structure since simple baselines fail even when given RuPAWS training samples.
Anthology ID:
2022.lrec-1.610
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5683–5691
Language:
URL:
https://aclanthology.org/2022.lrec-1.610
DOI:
Bibkey:
Cite (ACL):
Nikita Martynov, Irina Krotova, Varvara Logacheva, Alexander Panchenko, Olga Kozlova, and Nikita Semenov. 2022. RuPAWS: A Russian Adversarial Dataset for Paraphrase Identification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5683–5691, Marseille, France. European Language Resources Association.
Cite (Informal):
RuPAWS: A Russian Adversarial Dataset for Paraphrase Identification (Martynov et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.610.pdf
Code
 mts-ai/rupaws-dataset