Better Quality Pretraining Data and T5 Models for African Languages

,


Introduction
As language models have scaled up in size and multilingual capability in recent years, commensurate effort has followed to curate pretraining data (Raffel et al., 2020) to support this growth and improve the alignment of language models.
Earlier multilingual models such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2019) were trained on monolingual data from Wikipedia and/or other large-scale web crawls which included only a few African languages.The introduction of mC4 (Xue et al., 2021), a document-level dataset spanning 101 languages helped alleviate this cover-age gap.1 However, previous work (Kreutzer et al., 2022) has shown that mC4 and other existing largescale pretraining corpora have numerous quality issues, particularly for the low-resource African languages they contain.
This tradeoff between quantity and quality is forced by the unavailability of large, quality pretraining data for African languages.Motivated by this need, we introduce a new multilingual pretraining corpus in 20 African languages.We draw from Kreutzer et al. (2022)'s audit of existing pretraining corpora to understand prevailing quality issues.For mC4, they cite a high ratio both of sentences in incorrect languages (15.98% average) and nonlinguistic content (11.40% average).We trace these issues to the quality of data sources used in mC4 for the languages in our study and design heuristics to effectively extract clean monolingual text.
More notably, we demonstrate how large-scale web crawls and document-level datasets, such as mC4, can be enhanced through meticulous auditing of their document sources i.e., base URLs (e.g., www.voahausa.com).Interestingly, for numerous credible sources, mC4 encompasses fewer documents than what is actually available.We conduct our own web crawl of these sources, collecting more documents than what is present in mC4 for the respective languages.We consolidate the result of our efforts (cleaning and crawling) with data from other sources, notably Wikipedia, and include four high-resource languages -Arabic, English, French & Portuguese.
To evaluate the quality of our new corpus, we pretrain a new T5-based LM on the collected dataset and benchmark its performance on multiple downstream tasks.Our model demonstrates improved effectiveness over existing pretrained LMs further highlighting the importance of carefully curated datasets for pretraining language models in low-resource scenarios.Our model was significantly better than the baseline mT5 models across four different downstream tasks.Specifically, on cross-lingual QA evaluation, our new model achieves more than double the performance of multilingual T5.

WURA Dataset
We present WURA,2 a multilingual dataset comprising 16 African languages and 4 high-resource languages popularly spoken on the African continent -Arabic, English, French, and Portuguese.
The curation of WURA was carried out in a threepart process: -(i) Auditing and cleaning mC4 (ii) Crawling indigenous websites and (iii) Combination with existing language resources.

Language Contamination
Kreutzer et al. ( 2022) reports mC4's high ratio of non-linguistic content and sentences in incorrect languages, with African languages being of particular concern.The authors report significant loss (up to 50%) in recall of correct in-language sentences as they increased precision of their automatic language classification.
Our manual audit of mC4 corroborates the documented issues.We highlight three important findings: (1) The distribution of mC4 document sources has a long tail.Many individual news publications yield thousands of documents in the mC4.(2) Documents from news publications are more likely to be of higher quality i.e., both inlanguage and grammatical compared to documents from other web sources.(3) Some documents are from websites which translate content using online translation tools.Such documents are often a mix of in-language and noisy or non-linguistic text, and may best be filtered at sentence-level.Noting all of these issues and findings, we filter at three levels: Corpus-level.We first rank unique websites in descending order of the number of documents they contribute to the mC4 corpus for each language.Then, we select the top 20% of websites for each language and collect documents sourced from websites in this list.This preserves high potential sources for further document level filtering.Document-level.At document level, we filter out documents that do not contain at least 5 stopwords in them (Caswell et al., 2020)  Passage-level.After document-level filtering, we chunk the dataset into passages of roughly 512 tokens.Finally, we filter out passages that contain fewer than 4 unique words or contain repetition for more than 20% of its word length; have more than 40% of its characters are numeric or contain markers of possibly offensive content such as included in the Toxicity-200 dataset (NLLB Team et al., 2022) for the relevant language.
While Kreutzer et al. (2022)'s audit of mC4 did not yield a significant amount of offensive content (0.06% of sentences they audited) and our web crawls mainly focused on verified news publications, these filters ensure that non-linguistic and offensive contents are removed at the passage level.

mC4 is a Great Source!
Xue et al. (2021)'s inclusion of the URL each document is sourced from makes the mC4 corpus even more useful as a data source.Commonly, multiple articles are collected from the same base website, e.g., news publications.For many news publications that provide a sitemap, we find that there are fewer articles in mC4 than is actually available on the websites.Further, mC4 only covers up to August, 2020 so updating the crawls up to the current day yields more data.
We initiate focused crawls for such websites and this leads to significant increase (> 100% for Hausa and Somali) in the amount of articles available per language.For all languages we consider except Chichewa, Sesotho, Xhosa and Zulu, we collect 1.39M articles (see Table 6) from credible sources found in mC4.Table 1: MasakhaNews classification results: Evaluation is done using the weighted F1 score and the scores presented are averaged across 3 seeds.AfriTeVa V2 surpasses mT5-base by up to 10 points.The average scores excluding languages not in the mC4 corpus are also provided in AVG SL .

Combination with Existing Language Resources and Non-African Languages
Following previous works (Alabi et al., 2022;Adebara et al., 2022), we include certain non-African languages in our pretraining data.Specifically, we include over 240, 000 articles newly crawled from 10 African news websites reporting in English, French and Portuguese.We also include a sample of 1.5M Wikipedia articles for English and French, as well as Wikipedia articles written in Egyptian Arabic.For the African languages, we include all Wikipedia articles.Finally, we deduplicate using the document URLs.In doing this, we prioritize news articles in our focused crawls over their existing counterparts in mC4.
Final Dataset Statistics Table 6 presents a statistical summary of our dataset.The combined dataset from crawling, combining with existing sources and deduplication amounts to ∼30GB of data across all languages and ∼19GB for African languages.
3 Experimental Setup

Model
Using t5x and seqio (Roberts et al., 2022), we pretrain a T5 (Shazeer, 2020;Raffel et al., 2020) model with a subword-tokenizer of vocabulary size 150, 000.We pretrain for 524, 288 steps on the span-corruption objective using the Adafactor optimizer.Each training batch consists of 512 examples, each with an input of 512 tokens and an output of 114 tokens.Our new model is known as AfriTeVa V2, a 428M parameter model.

Cross-lingual Question Answering
We evaluated our models on the test set of AfriQA Ogundepo et al. (2023), a cross-lingual question answering dataset with questions in 10 African languages and gold passages in English or French.We evaluated in zero-shot generative cross-lingual QA settings using in-lang queries and the provided gold passages in English.

Machine Translation
We evaluated using MAFAND-MT (Adelani et al., 2022) − a machine translation benchmark in the news domain.MAFAND-MT contains few thousand parallel training sentences (2, 500-30, 000 sentences) for 16 African languages, ideal for evaluating the effective adaptation of pretrained LMs to new languages and domains.

Summarization
For summarization, we use XL-Sum (Hasan et al., 2021), an abstractive summarization dataset which covers 44 languages, including 9 African languages.The authors establish strong baselines on both low and high-resource languages in the dataset through multilingual finetuning of mT5.

Text Classification
We use the news topic classification dataset recently introduced by Adelani et al. ( 2023) for 16 African languages, MasakhaNews.The authors establish multiple baselines on the dataset using both classical machine learning models and finetuning or prompting language models.

Baseline Models
We Table 2: MAFAND-MT results: Evaluation is done using the BLEU score and we obtain significantly better performance on average across all languages in both the en-xx and xx-en directions, except for ibo and pcm.
AfriByT5 are adapted from mT5 and ByT5 models using continual pretraining.Apart from AfriTeVa, AfriTeVa V2 has ∼26% less parameters than the other baseline models.
4 Result and Discussion

Downstream Performance
In this section, we compare AfriTeVa V2 to baseline models on selected tasks.For each downstream task, we evaluate under the same conditions.We performed per-language finetuning for machine translation & text classification, multilingual finetuning over 35K steps for summarization.

Cross-lingual Question Answering:
AfriTeVa V2 achieves very impressive results in the cross-lingual question-answering task, especially for languages in our pretraining data.We finetune on the train set of Squad 2.0 (Rajpurkar et al., 2016) dataset and evaluate the models performance on the test set AfriQA.We compare performance on generative gold passage answer prediction, with in-language queries and English passages.Table 4 shows that AfriTeVa V2 achieves much better F1 scores and Exact Match accuracies (∼2×) across 6 out of 7 languages compared to using mT5-Base as the back-bone model.

Machine Translation
We observe higher BLEU scores when translating from African languages into English than in the reverse direction.According to Table 2, we achieve a better score on average, topping mT5 and AfriMT5 base models by ∼1-3 points.While both ByT5style models show greater effectiveness over the mT5 models, AfriTeVa V2 consistently improves over both results for all languages except ibo and pcm, an English-based creole language.

Summarization
We perform multilingual training for 35, 000 steps and sample each batch from a single language.Table 3 shows we match the performance of mT5 on orm & pcm and gain improvements over baseline Rouge scores for the other languages we consider, with yor benefiting the most.

Text Classification
Our results for the news classification task are presented in Table 1.We finetune AfriTeVa V2 on MasakhaNews for each language, framing it as a text-to-text task by predicting the class of each article in the decoding sequence and report results of 3 random seeds.On average, AfriTeVa V2 yields better F1 scores across all languages and has the best F1 score on 10 out of 16 languages.

Results for Nigerian Pidgin
AfriTeVa V2 does not outperform baselines for text classification, machine translation and summarization on Nigerian Pidgin (pcm).We note that AfriTeVa V2 was not pretrained on Nigerian Pidgin.As Nigerian Pidgin is an English-based creole, models pretrained on large amounts of English text are expected to be performant for the language.However, AfriTeVa V2 was pretrained on far less English text than the baselines we compare to, save for AfriTeVa.Still, we obtains results for Nigerian Pidgin that are competitive with the best baselines across the evaluation tasks.

Impact of Data Quality on LMs
Previous works have shown the correlation between the quality of the data used in pretraining a model and the performance of the trained model (Rae et al., 2021;Kreutzer et al., 2022;Hernandez et al., 2022).AfriTeVa V2's improvement over baselines in downstream tasks suggests that this is true.We note that AfriTeVa V2 outperforms the  (Ogundepo et al., 2023).For both metrics, AfriTeVa V2 outperforms mT5 except for twi.
larger AfriMT5 & AfriByT5 (Alabi et al., 2022) which were trained on unfiltered mC4 corpus.However, our pretraining dataset, WURA, contains ∼1.5× more data than mC4 contains across 16 African languages.Thus, more experiments are needed to separate the effects of scale from that of data quality.

AfriTeVa V2 Large Model
We also pre-train a large variant of AfriTeVa V2 using the same configuration of the T5-large model except for the vocabulary size which we set to be 150, 000, similar to the configuration of AfriTeVa V2 (base) as detailed in subsection 3.1.We present the effectiveness of scaling to a large model size on summarization and news topic classification tasks in Appendix C. 4

Related Work
Absence of a large monolingual corpus has always been the major challenge of leveraging the benefits of self-supervised pretraining for building representation and language models for African languages.The most available corpus are mostly from religious corpus like Bible (Resnik et al., 1999) or JW300 (Agić and Vulić, 2019), Wikipedia and Common Crawl archive.The latter often has significant quality issues (Kreutzer et al., 2022).
Earlier works on building word representation models for African languages showed the importance of developing FastText embeddings with small high-quality data (Alabi et al., 2020) over pretrained FastText embeddings developed from 4 Due to space constraint, we include results in appendix.
noisier common crawl data.Obtaining such highquality data is tedious since it involved curating several verified sources manually.Thus, previous works have prioritized filtering of the common crawl data to produce better quality dataset for pretraining (Conneau et al., 2020;Ortiz Suárez et al., 2019;Xue et al., 2021;Bapna et al., 2022).However, quality issues still persist in those filtered corpora.An alternative to this is basically aggregating high quality data for African languages mostly from verified sources (Ogueji et al., 2021;Leong et al., 2022;Palen-Michel et al., 2022).However, this often results in smaller sized corpus.
The current models with impressive performance on African languages simply aggregate both lowquality data and high-quality data for pretraining (Alabi et al., 2022;Adebara et al., 2022).The quality of these models implies that there must be significant portions of the data that are of good quality.To this end, we systematically and rigorously filtered these low-quality data from mC4 corpus for African languages, similar to the OSCAR dataset approach. 5To the best of our knowledge, no previous work has done this.OSCAR dataset only has few documents for African languages e.g., 37.2MB for Afrikaans dataset while our filtered corpus has more than 4.5 GB.

Conclusion
In this work, we look to address the lack of large, quality pretraining dataset for African languages.While previous works have highlighted quality issues in existing pretraining dataset such as mC4, we demonstrate how these datasets can be enhanced by auditing their document sources and incorporating rigorous data filtering methods.To highlight the effectiveness of our approach and the relevance of this new dataset, we train a new T5 model, AfriTeVa V2, on our dataset.Our experiments show significant improvements across existing NLP benchmarks for African languages underscoring the impact of qualitative pretraining data in training language models.

Limitations
The representativeness of our dataset poses a potential limitation.Despite our efforts to collect data from multiple African news websites, it is possible that our dataset does not fully capture the breadth and diversity of African news articles.The reliance on specific websites and the utilization of the mC4 dataset, along with existing corpora, may introduce inherent bias that our work does not address.Furthermore, our implementation of several-level filtering techniques, including the removal of nonlinguistic content in the target language, does not guarantee the complete removal of all text in different languages or other toxic contents that may be present in the existing corpus.
Lastly, we acknowledge the need for future work to include more African languages.Our dataset only covers 16 languages, limiting the generalizability of our findings across the wide range of languages spoken in Africa.

A Data
A.1 mC4 Audit We aim to tease out heuristics that are guaranteed to help us quickly and reliably extract high-quality monolingual text across the African languages in mC4.First, we reduce the source URL of each document to its hostname6 and keep a list of unique hostnames that exist for each language.For each language, we first sample a hostname then sample 20 documents sourced from the sampled hostname.This sampling strategy not only allows to audit more documents and sources faster, it allows us trace existing quality issues to the source URLs that produced the documents.We follow non-expert auditing strategies proposed by Kreutzer et al. (2022).Additionally, we also visit the hostname URL7 to ascertain its purpose for speakers of the language and translate paragraphs in the document using Google Translate.include the category under which each article was published.This information may be useful for identification of the domains in our dataset.We also release a list of the top document URLs for each language10 and invite native speakers to audit these sources to help us improve the quality of WURA.

B Tokenization
In multilingual settings, the design of tokenizers has great impact on the downstream utility and cost of inference of language models across languages (Petrov et al., 2023;Ahia et al., 2023).We characterize the performance of our tokenizers using fertility (Ács., 2019), defined as the number of subwords created per word (or per dataset) by the tokenizer.We compute fertility on the langauges covered by MasakhanePOS (Dione et al., 2023).We train multiple unigram language models on our dataset using Sentencepiece (Kudo and Richardson, 2018) with vocabulary sizes ranging from 100, 000 to 250, 000.As shown in Table 6 above, our dataset sizes varies over orders of magnitude between languages.To alleviate unfair treatment of the lowest-resourced of the languages we consider, we follow Guillaume Lample and Alexis Conneau (2019) to learn the unigram language models on sentences sampled according to a multinomial distribution with probabilities q ii=1...N calculated as follows: N denotes the number of languages and n i , the number of sentences in language i.We denote this as sampling configuration 1 .We also investigate a sampling configuration 2 in which we further upsample languages which still do not have adequate

C AfriTeVa V2 Large
We also pretrain a large variant of AfriTeVa V2 and present its effectiveness on summarization (Table 7) and classification (Table 8).For summarization, we finetune both models for 10 epochs and make inference using beam search with width of 4. We gain improvements over the base model across both tasks, particularly for summarization where ibo benefits the most.
Pedro Ortiz Suarez, Benoît Sagot, and Laurent Romary.2019.Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures.

Table 5 :
Tokenizer Fertilities: We measure the fertilities of our tokenizers with varying vocabulary sizes using the MasakhanePOS dataset.The 150k tokenizer gives the best trade-off in size and fertility scores across all languages, especially in the second sampling configuration.