Filtering of Noisy Web-Crawled Parallel Corpus: the Japanese-Bulgarian Language Pair

Iglika Nikolova-Stoupak; Shuichiro Shimizu; Chenhui Chu; Sadao Kurohashi

Filtering of Noisy Web-Crawled Parallel Corpus: the Japanese-Bulgarian Language Pair

Iglika Nikolova-Stoupak, Shuichiro Shimizu, Chenhui Chu, Sadao Kurohashi

Abstract

One of the main challenges within the rapidly developing field of neural machine translation is its application to low-resource languages. Recent attempts to provide large parallel corpora in rare language pairs include the generation of web-crawled corpora, which may be vast but are, unfortunately, excessively noisy. The corpus utilised to train machine translation models in the study is CCMatrix, provided by OPUS. Firstly, the corpus is cleaned based on a number of heuristic rules. Then, parts of it are selected in three discrete ways: at random, based on the “margin distance” metric that is native to the CCMatrix dataset, and based on scores derived through the application of a state-of-the-art classifier model (Acarcicek et al., 2020) utilised in a thematic WMT shared task. The performance of the issuing models is evaluated and compared. The classifier-based model does not reach high performance as compared with its margin-based counterpart, opening a discussion of ways for further improvement. Still, BLEU scores surpass those of Acarcicek et al.’s (2020) paper by over 15 points.

Anthology ID:: 2022.clib-1.4
Volume:: Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)
Month:: September
Year:: 2022
Address:: Sofia, Bulgaria
Venue:: CLIB
SIG:
Publisher:: Department of Computational Linguistics, IBL -- BAS
Note:
Pages:: 39–48
Language:
URL:: https://aclanthology.org/2022.clib-1.4/
DOI:
Bibkey:
Cite (ACL):: Iglika Nikolova-Stoupak, Shuichiro Shimizu, Chenhui Chu, and Sadao Kurohashi. 2022. Filtering of Noisy Web-Crawled Parallel Corpus: the Japanese-Bulgarian Language Pair. In Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022), pages 39–48, Sofia, Bulgaria. Department of Computational Linguistics, IBL -- BAS.
Cite (Informal):: Filtering of Noisy Web-Crawled Parallel Corpus: the Japanese-Bulgarian Language Pair (Nikolova-Stoupak et al., CLIB 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.clib-1.4.pdf

PDF Cite Search Fix data