Does Corpus Quality Really Matter for Low-Resource Languages?

The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking representation learning in Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with <33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream NLU tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is not primarily constrained by the quality of the data, and other factors like corpus size and domain coverage can play a more important role.


Introduction
Large-scale pre-training has resulted in a paradigm shift in NLP (Bommasani et al., 2021).While recent progress has been primarily driven by scaling up on model size and compute, both data quantity and quality have been shown to play a critical role (Kaplan et al., 2020;Rae et al., 2022).Nevertheless, existing efforts on data curation have primarily focused on English, and recent work on multilingual pre-training has relied on automatically filtered versions of CommonCrawl.For instance, XLM-R was trained on CC100 (Conneau et al., 2020), mT5 was trained on mC4 (Xue et al., 2021), and XGLM was trained on CC100-XL (Lin et al., 2021), which were all obtained by running language identification on several CommonCrawl snapshots and filtering through language-agnostic approaches.Unfortunately, Kreutzer et al. (2021) identified major issues on the quality of such multilingual datasets, ranging from language identification errors to boilerplate and non-linguistic content.However, the practical impact of these issues has not been studied, and it is unclear the extent to which higher-quality data could lead to better performance in low-resource languages.
In this paper, we take representation learning in Basque as a case study, and explore tailored crawling (i.e., manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl.We introduce EusCrawl, a new corpus for Basque comprising 12.5M documents from 33 websites with Creative Commons content.EusCrawl is similar in size to the Basque portion of CC100 and mC4, but it has substantially less issues and a higher perceived quality according to our blind audit with native annotators.However, we find that this improvement does not carry over to downstream NLU tasks, as masked language models pre-trained on either corpora obtain similar results on 5 benchmarks.Our results suggests that data quantity and domain coverage play a more important role, prompting for methods to exploit more diverse sources of data in low-resource languages.
This paper makes the following contributions: (i) we release EusCrawl, a high-quality corpus for Basque comprising 12.5M documents and 423M tokens;1 (ii) we manually assess the quality of EusCrawl in comparison with mC4 and CC100, finding that it has substantially less issues and a higher perceived quality according to native annotators; (iii) we compare masked language models pretrained on EusCrawl, mC4, CC100 and Wikipedia2 on 5 NLU tasks, finding that they all perform similarly with the exception of Wikipedia; and (iv) we obtain state-of-the-art results on several NLU benchmarks in Basque, outperforming prior work that relied on non-public corpora.

Experimental setup
We next detail the corpora compared in our experiments ( §2.1), and the qualitative and downstream evaluation settings ( §2.2 and §2.3).

Corpora
We compare 4 Basque corpora in our experiments: mC4, CC100, Wikipedia and EusCrawl.Table 1 summarizes their details.mC4 3 and CC100 4 are, to the best of our knowledge, the two largest public corpora for Basque.They were introduced to train mT5 (Xue et al., 2021) and XLM-R (Conneau et al., 2020), respectively, and were built by filtering CommonCrawl.Wikipedia has been a popular source for multilingual data (Pires et al., 2019;Conneau and Lample, 2019;Artetxe et al., 2020).We extract text from a Wikipedia dump using the WikiExtractor tool. 5EusCrawl is a new corpus we introduce.Instead of filtering Common-Crawl, we do tailored crawling on 33 websites with high-quality content in Basque, mostly on the news domain.We build ad-hoc scrapers to extract text from these websites, resulting in higher coverage 6and cleaner text compared to general purpose approaches.We only use content with a Creative Commons license.Table 2 summarizes all the sources we use.

Qualitative evaluation
We manually audit the quality of EusCrawl in comparison with mC4 and CC100 by randomly sampling 100 documents from each corpus (a total of 300 documents), and asking native annotators to assess their quality. 7We ensure that the evaluation is blind by showing the documents in a random order and not revealing what corpus they were sampled from.For each document, we ask the annotators to assess if the document has any problem in each of the following categories: langID (the document is not in Basque), language variety (the document is not written in standard and correct Basque), coherence (the document has gaps and/or some portions are not connected), noise (the document is not clean) and content (the document seems to have been generated automatically and/or has no meat).In addition, we ask annotators to classify each document according to its perceived quality as high-quality (the document does not have quality issues and the annotator thinks that it would be good to have it in the corpus), medium-quality (the document has some minor issues and the annotator is unsure if it would be good to have it in the corpus), or low-quality (the document has major issues and the annotator thinks that it would be better not to have it in the corpus).Refer to Appendix A for the complete instructions given to annotators.
We pre-train each model for 125k steps with a batch size of 2048 and a sequence length of 512, using the same hyperparameters as Liu et al. (2019).We train RoBERTa-base models for our main comparison using a learning rate of 7e-4, and further train a RoBERTa-large model on EusCrawl with a learning rate of 4e-4 to understand the effect of scaling.In all cases, we use the final checkpoint without early stopping.We use SentencePiece (Kudo and Richardson, 2018) for tokenization, using a 50k vocabulary learned in each separate corpus.
For fine-tuning, we use the same hyperparameters as Agerri et al. (2020).For topic classification, sentiment classification and stance detection, we use a batch size of 16, a learning rate of 2e-5 with linear decay and a warmup of 6%, and train the model for 10 epochs.For NER and QA, we use a batch size of 32, a constant learning rate of 5e-5, and train for 4 epochs.We did not perform any hyperparameter tuning or model selection, and report results on the test set.The development sets, when available, were not used.

Qualitative evaluation
As shown in Figure 1, EusCrawl has the best quality by a large margin in all the axes that we consider.mC4 has a slightly higher perceived quality and less content-related issues than CC100, but more problematic documents in the other categories.
More concretely, we find that both mC4 and CC100 have a high proportion of documents with coherence, noise and content-related issues.In addition, mC4 has a significant number of langID and language variety problems.In contrast, EusCrawl has minimal issues in all categories but content, where it still does substantially better than mC4 and CC100.Taking a closer look, we find that most of these content-related issues in EusCrawl correspond to short, template-based Wikipedia articles (e.g., Placosoma is a a genus of lizards in the family Gymnophthalmidae.They live in Brazil.8 ), which should be easy to filter in future iterations.Finally, we find that the overall quality of EusCrawl documents is also much better according to native annotators, with approximately two thirds of the docu- ments being annotated as high-quality, compared to less than one third for both mC4 and CC100.All in all, our qualitative evaluation provides further evidence that multilingual corpora derived from CommonCrawl have major quality issues, and shows that tailored crawling can be an effective alternative to obtain high-quality data.

Downstream tasks
We report our downstream results in Table 3.
In contrast with the qualitative evaluation, we find that there is not a clear winner among mC4, CC100 and EusCrawl.In fact, when looking at RoBERTa-base results, we find that mC4 does the best on sentiment classification, CC100 does the best on stance detection and QA, and EusCrawl does the best on NER.Wikipedia lags behind them all by a large margin.It is worth noting that the variance is high in certain tasks, which we attribute to the small size of the test sets and their unbalanced nature, but the general trends are consistent.
These results suggest that corpus quality issues in low-resource languages do not have a a major impact on NLU performance.Instead, we find evidence that it is the size and domain of the training corpus that is more important.This would explain why Wikipedia obtains the worst results, as it is substantially smaller than the other corpora and restricted to a narrow domain.Similarly, this is also consistent with EusCrawl performing worse than mC4 and CC100 on sentiment analysis and stance detection, as the domain of these benchmarks (tweets) is different from the domain of EusCrawl (primarily news, see Table 2), while CommonCrawl-derived corpora are presumably more diverse.
Finally, we find that scaling to RoBERTalarge brings consistent improvements in all tasks.
Thanks to this, we are able to outperform the best published results in all the 5 benchmarks.Note that we achieve this pre-training exclusively on Creative Commons data that we release publicly, while prior work relied on private datasets.

Conclusions
Taking Basque as a case study, our work gives further evidence that CommonCrawl-derived corpora have major quality issues in low-resource languages.At the same time, we show that ad-hoc crawling websites with high-quality content can be an effective alternative to collect data in such languages.Our resulting corpus EusCrawl has a higher quality than mC4 and CC100 according to our manual data audit, while being similar in size.Nevertheless, this improvement in quality does not carry over to downstream performance on NLU tasks, where we find evidence that data quantity and domain coverage are more important factors.
Our work leaves important lessons for future efforts on low-resource languages.First of all, we find that, even if CommonCrawl derived multilingual corpora do have major quality issues as raised by prior work (Kreutzer et al., 2021), these issues do not have a significant impact in NLU tasks.This suggests that investing on bigger and more diverse datasets might be more fruitful than addressing such quality issues in low-resource settings.Given that the amount of written text in such languages is ultimately limited, we believe that developing effective cross-lingual transfer methods to exploit multilingual data is a promising future direction.
Having said that, it should be noted that our study is limited to NLU tasks in a single language.It is possible that data quality plays a more important role in generation tasks, which we leave for future work to study.In addition, we think that it would be valuable to conduct similar studies in other languages to corroborate our findings.
Finally, we note that prior work on Basque NLP has often relied on private resources (Agerri et al., 2020).Our work sets a new state-of-the-art on a diverse set of NLU benchmarks, and it does so using public data alone.By releasing our corpus, we hope to facilitate future work in Basque NLP, and encourage open and reproducible science using public resources.

Limitations
Our evaluation focuses on NLU tasks, and it is possible that data quality plays a different role in generation tasks.We note, however, that generation quality is harder to evaluate through automatic metrics, which is why we decided to focus on NLU tasks.Moreover, the corpora that we compare differ on various aspects other than the data quality (e.g., the domain), and it is hard to isolate the effect of quality from the rest.In any case, we believe that our main claim still holds, in that data quality has a minor impact relative to such other factors.Finally, our work builds on EusCrawl-a new highquality corpus that we introduce for Basque-and our analysis is thus limited to this language.It would be interesting to collect high-quality corpora for other low-resource languages, and conduct a similar comparison to corroborate that our findings also apply more broadly.
Colin Raffel.2021.mT5: A massively multilingual pre-trained text-to-text transformer.In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483-498, Online.Association for Computational Linguistics.

A Annotation instructions
Table 4 reports the complete instructions used for the qualitative evaluation as given to the annotators.

B Downstream evaluation
We next provide additional details on the datasets used for downstream evaluation: • Topic classification: The Basque Headlines Topic Classification (BHTC) dataset (Agerri et al., 2020) contains 12k headlines from the Argia news magazine classified into 12 thematic categories9 .We use the standard splits containing 8662 examples for training, 1861 for development and 1860 for testing.

Figure 1 :
Figure1: Data audit results.EusCrawl has a much higher quality than mC4 and CC100.See §2.2 for more details.

Table 3 :
Downstream results.We report average F1 and standard deviation across 5 runs (micro F1 in all tasks except stance detection, where we report macro F1 of the favor and against classes following common practice).†Best result among systems that rely exclusively on textual data.
, which offers tweets labeled as expressing an AGAINST, FAVOR or NEUTRAL stance with respect to vaccines.It contains 1070 tweets for training and 313 for testing 11 .