BiMax: Bidirectional MaxSim Score for Document-Level Alignment

Xiaotian Wang, Takehito Utsuro, Masaaki Nagata


Abstract
Document alignment is necessary for the hierarchical mining, which aligns documents across source and target languages within the same web domain. Several high-precision sentence embedding-based methods have been developed, such as TK-PERT and Optimal Transport (OT). However, given the massive scale of web mining data, both accuracy and speed must be considered.In this paper, we propose a cross-lingual Bidirectional Maxsim score (BiMax) for computing doc-to-doc similarity,to improve efficiency compared to the OT method.Consequently, on the WMT16 bilingual document alignment task,BiMax attains accuracy comparable to OT with an approximate 100-fold speed increase.Meanwhile, we also conduct a comprehensive analysis to investigate the performance of current state-of-the-art multilingual sentence embedding models.
Anthology ID:
2025.findings-emnlp.704
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13095–13116
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.704/
DOI:
Bibkey:
Cite (ACL):
Xiaotian Wang, Takehito Utsuro, and Masaaki Nagata. 2025. BiMax: Bidirectional MaxSim Score for Document-Level Alignment. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 13095–13116, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
BiMax: Bidirectional MaxSim Score for Document-Level Alignment (Wang et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.704.pdf
Checklist:
 2025.findings-emnlp.704.checklist.pdf