Investigating Web Corpus Filtering Methods for Language Model Development in Japanese

Rintaro Enomoto, Arseny Tolmachev, Takuro Niitsuma, Shuhei Kurita, Daisuke Kawahara


Abstract
The development of large language models (LLMs) is becoming increasingly significant, and there is a demand for high-quality, large-scale corpora for their pretraining.The quality of a web corpus is especially essential to improve the performance of LLMs because it accounts for a large proportion of the whole corpus. However, filtering methods for Web corpora have yet to be established.In this paper, we present empirical studies to reveal which filtering methods are indeed effective and analyze why they are.We build classifiers and language models in Japanese that can process large amounts of corpora rapidly enough for pretraining LLMs in limited computational resources. By evaluating these filtering methods based on a Web corpus quality evaluation benchmark, we reveal that the most accurate method is the N-gram language model. Indeed, we empirically present that strong filtering methods can rather lead to lesser performance in downstream tasks.We also report that the proportion of some specific topics in the processed documents decreases significantly during the filtering process.
Anthology ID:
2024.naacl-srw.18
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yang (Trista) Cao, Isabel Papadimitriou, Anaelia Ovalle
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
154–160
Language:
URL:
https://aclanthology.org/2024.naacl-srw.18
DOI:
Bibkey:
Cite (ACL):
Rintaro Enomoto, Arseny Tolmachev, Takuro Niitsuma, Shuhei Kurita, and Daisuke Kawahara. 2024. Investigating Web Corpus Filtering Methods for Language Model Development in Japanese. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 154–160, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Investigating Web Corpus Filtering Methods for Language Model Development in Japanese (Enomoto et al., NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-srw.18.pdf