Noisy Parallel Data Alignment

Ruoyu Xie, Antonios Anastasopoulos


Abstract
An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.
Anthology ID:
2023.findings-eacl.111
Volume:
Findings of the Association for Computational Linguistics: EACL 2023
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1501–1513
Language:
URL:
https://aclanthology.org/2023.findings-eacl.111
DOI:
10.18653/v1/2023.findings-eacl.111
Bibkey:
Cite (ACL):
Ruoyu Xie and Antonios Anastasopoulos. 2023. Noisy Parallel Data Alignment. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1501–1513, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Noisy Parallel Data Alignment (Xie & Anastasopoulos, Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-eacl.111.pdf
Video:
 https://aclanthology.org/2023.findings-eacl.111.mp4