A Simple Yet Effective Corpus Construction Method for Chinese Sentence Compression

Yang Zhao, Hiroshi Kanayama, Issei Yoshida, Masayasu Muraoka, Akiko Aizawa


Abstract
Deletion-based sentence compression in the English language has made significant progress over the past few decades. However, there is a lack of large-scale and high-quality parallel corpus (i.e., (sentence, compression) pairs) for the Chinese language to train an efficient compression system. To remedy this shortcoming, we present a dependency-tree-based method to construct a Chinese corpus with 151k pairs of sentences and compression based on Chinese language-specific characteristics. Subsequently, we trained both extractive and generative neural compression models using the constructed corpus. The experimental results show that our compression model can generate high-quality compressed sentences on both automatic and human evaluation metrics compared with the baselines. The results of the faithfulness evaluation also indicated that the Chinese compression model trained on our constructed corpus can produce more faithful compressed sentences. Furthermore, a dataset with 1,000 pairs of sentences and ground truth compression was manually created for automatic evaluation, which, we believe, will benefit future research on Chinese sentence compression.
Anthology ID:
2022.lrec-1.742
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6874–6883
Language:
URL:
https://aclanthology.org/2022.lrec-1.742
DOI:
Bibkey:
Cite (ACL):
Yang Zhao, Hiroshi Kanayama, Issei Yoshida, Masayasu Muraoka, and Akiko Aizawa. 2022. A Simple Yet Effective Corpus Construction Method for Chinese Sentence Compression. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6874–6883, Marseille, France. European Language Resources Association.
Cite (Informal):
A Simple Yet Effective Corpus Construction Method for Chinese Sentence Compression (Zhao et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.742.pdf
Data
OCNLISentence Compression