Is Character Trigram Overlapping Ratio Still the Best Similarity Measure for Aligning Sentences in a Paraphrased Corpus?

Aleksandra Smolka; Hsin-Min Wang; Jason S. Chang; Keh-Yih Su

Is Character Trigram Overlapping Ratio Still the Best Similarity Measure for Aligning Sentences in a Paraphrased Corpus?

Aleksandra Smolka, Hsin-Min Wang, Jason S. Chang, Keh-Yih Su

Abstract

Sentence alignment is an essential step in studying the mapping among different language expressions, and the character trigram overlapping ratio was reported to be the most effective similarity measure in aligning sentences in the text simplification dataset. However, the appropriateness of each similarity measure depends on the characteristics of the corpus to be aligned. This paper studies if the character trigram is still a suitable similarity measure for the task of aligning sentences in a paragraph paraphrasing corpus. We compare several embedding-based and non-embeddings model-agnostic similarity measures, including those that have not been studied previously. The evaluation is conducted on parallel paragraphs sampled from the Webis-CPC-11 corpus, which is a paragraph paraphrasing dataset. Our results show that modern BERT-based measures such as Sentence-BERT or BERTScore can lead to significant improvement in this task.

Anthology ID:: 2022.rocling-1.7
Volume:: Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)
Month:: November
Year:: 2022
Address:: Taipei, Taiwan
Editors:: Yung-Chun Chang, Yi-Chin Huang
Venue:: ROCLING
SIG:
Publisher:: The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
Note:
Pages:: 49–60
Language:
URL:: https://aclanthology.org/2022.rocling-1.7/
DOI:
Bibkey:
Cite (ACL):: Aleksandra Smolka, Hsin-Min Wang, Jason S. Chang, and Keh-Yih Su. 2022. Is Character Trigram Overlapping Ratio Still the Best Similarity Measure for Aligning Sentences in a Paraphrased Corpus?. In Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022), pages 49–60, Taipei, Taiwan. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP).
Cite (Informal):: Is Character Trigram Overlapping Ratio Still the Best Similarity Measure for Aligning Sentences in a Paraphrased Corpus? (Smolka et al., ROCLING 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.rocling-1.7.pdf

PDF Cite Search Fix data