SentAlign: Accurate and Scalable Sentence Alignment

Steinthor Steingrimsson, Hrafn Loftsson, Andy Way


Abstract
We present SentAlign, an accurate sentence alignment tool designed to handle very large parallel document pairs. Given user-defined parameters, the alignment algorithm evaluates all possible alignment paths in fairly large documents of thousands of sentences and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences. The scoring function is based on LaBSE bilingual sentence representations. SentAlign outperforms five other sentence alignment tools when evaluated on two different evaluation sets, German-French and English-Icelandic, and on a downstream machine translation task.
Anthology ID:
2023.emnlp-demo.22
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:
December
Year:
2023
Address:
Singapore
Editors:
Yansong Feng, Els Lefever
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
256–263
Language:
URL:
https://aclanthology.org/2023.emnlp-demo.22
DOI:
10.18653/v1/2023.emnlp-demo.22
Bibkey:
Cite (ACL):
Steinthor Steingrimsson, Hrafn Loftsson, and Andy Way. 2023. SentAlign: Accurate and Scalable Sentence Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 256–263, Singapore. Association for Computational Linguistics.
Cite (Informal):
SentAlign: Accurate and Scalable Sentence Alignment (Steingrimsson et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-demo.22.pdf
Video:
 https://aclanthology.org/2023.emnlp-demo.22.mp4