A Large-Scale Benchmark for Vietnamese Sentence Paraphrases

Sang Quang Nguyen; Kiet Van Nguyen

doi:10.18653/v1/2025.findings-naacl.59

A Large-Scale Benchmark for Vietnamese Sentence Paraphrases

Abstract

This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original–paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks. The dataset is available for research purposes at https://github.com/ngwgsang/ViSP.

Anthology ID:: 2025.findings-naacl.59
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1045–1060
Language:
URL:: https://aclanthology.org/2025.findings-naacl.59/
DOI:: 10.18653/v1/2025.findings-naacl.59
Bibkey:
Cite (ACL):: Sang Quang Nguyen and Kiet Van Nguyen. 2025. A Large-Scale Benchmark for Vietnamese Sentence Paraphrases. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1045–1060, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: A Large-Scale Benchmark for Vietnamese Sentence Paraphrases (Nguyen & Nguyen, Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-naacl.59.pdf

PDF Cite Search Fix data