Rintaro Enomoto
2024
Investigating Web Corpus Filtering Methods for Language Model Development in Japanese
Rintaro Enomoto
|
Arseny Tolmachev
|
Takuro Niitsuma
|
Shuhei Kurita
|
Daisuke Kawahara
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
The development of large language models (LLMs) is becoming increasingly significant, and there is a demand for high-quality, large-scale corpora for their pretraining.The quality of a web corpus is especially essential to improve the performance of LLMs because it accounts for a large proportion of the whole corpus. However, filtering methods for Web corpora have yet to be established.In this paper, we present empirical studies to reveal which filtering methods are indeed effective and analyze why they are.We build classifiers and language models in Japanese that can process large amounts of corpora rapidly enough for pretraining LLMs in limited computational resources. By evaluating these filtering methods based on a Web corpus quality evaluation benchmark, we reveal that the most accurate method is the N-gram language model. Indeed, we empirically present that strong filtering methods can rather lead to lesser performance in downstream tasks.We also report that the proportion of some specific topics in the processed documents decreases significantly during the filtering process.