We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.
This paper describes our submission to the WMT24 shared task for Low-Resource Languages of Spain in the Constrained task category. Due to the lack of deep learning-based data filtration methods for these languages, we propose a purely statistical-based, two-stage pipeline for data filtration. In the primary stage, we begin by removing spaces and punctuation from the source sentences (Spanish) and deduplicating them. We then filter out sentence pairs with inconsistent language predictions by the language identification model, followed by the removal of pairs with anomalous sentence length and word count ratios, using the development set statistics as the threshold. In the secondary stage, for corpora of significant size, we employ a Jensen Shannon divergence-based method to curate training data of the desired size. Our filtered data allowed us to complete a two-step training process in under 3 hours, with GPU power consumption kept below 1 kWh, making our system both economical and eco-friendly. The source code, training data, and best models are available on the project’s GitHub page.
We analysed a sample of NLP research papers archived in ACL Anthology as an attempt to quantify the degree of openness and the benefit of such an open culture in the NLP community. We observe that papers published in different NLP venues show different patterns related to artefact reuse. We also note that more than 30% of the papers we analysed do not release their artefacts publicly. Further, we observe a wide language-wise disparity in publicly available NLP-related artefacts.
A word frequency list is a list of unique words in a language along with their frequency count. It is generally sorted by frequency. Such a list is essential for many NLP tasks, including building language models, POS taggers, spelling checkers, word separation guides, etc., in addition to assisting language learners. Such lists are available for many languages, but a large-scale word list is still not available for Sinhala. We have developed a comprehensive list of words, together with their frequency and part-of-speech (POS), from a large textbase. Unlike many other such lists, our list includes a large number of low-frequency words (many of which are erroneous), which enables the analysis of such words, including the frequencies of errors. In addition to the main list, we have also prepared a list of linguistically verified words. The word frequency list and the verified word list are the largest collections of words lists that are available for the Sinhala language.