ParaZh-22M: A Large-Scale Chinese Parabank via Machine Translation

Wenjie Hao; Hongfei Xu (许鸿飞); Deyi Xiong; Hongying Zan (昝红英); Lingling Mu (穆玲玲)

ParaZh-22M: A Large-Scale Chinese Parabank via Machine Translation

Wenjie Hao, Hongfei Xu, Deyi Xiong, Hongying Zan, Lingling Mu

Abstract

Paraphrasing, i.e., restating the same meaning in different ways, is an important data augmentation approach for natural language processing (NLP). Zhang et al. (2019b) propose to extract sentence-level paraphrases from multiple Chinese translations of the same source texts, and construct the PKU Paraphrase Bank of 0.5M sentence pairs. However, despite being the largest Chinese parabank to date, the size of PKU parabank is limited by the availability of one-to-many sentence translation data, and cannot well support the training of large Chinese paraphrasers. In this paper, we relieve the restriction with one-to-many sentence translation data, and construct ParaZh-22M, a larger Chinese parabank that is composed of 22M sentence pairs, based on one-to-one bilingual sentence translation data and machine translation (MT). In our data augmentation experiments, we show that paraphrasing based on ParaZh-22M can bring about consistent and significant improvements over several strong baselines on a wide range of Chinese NLP tasks, including a number of Chinese natural language understanding benchmarks (CLUE) and low-resource machine translation.

Anthology ID:: 2022.coling-1.341
Volume:: Proceedings of the 29th International Conference on Computational Linguistics
Month:: October
Year:: 2022
Address:: Gyeongju, Republic of Korea
Editors:: Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:: COLING
SIG:
Publisher:: International Committee on Computational Linguistics
Note:
Pages:: 3885–3897
Language:
URL:: https://aclanthology.org/2022.coling-1.341/
DOI:
Bibkey:
Cite (ACL):: Wenjie Hao, Hongfei Xu, Deyi Xiong, Hongying Zan, and Lingling Mu. 2022. ParaZh-22M: A Large-Scale Chinese Parabank via Machine Translation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3885–3897, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):: ParaZh-22M: A Large-Scale Chinese Parabank via Machine Translation (Hao et al., COLING 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.coling-1.341.pdf

PDF Cite Search Fix data