Lasitha Randunu Chandrakantha Uyangodage
2025
Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)
Hansi Hettiarachchi
|
Tharindu Ranasinghe
|
Paul Rayson
|
Ruslan Mitkov
|
Mohamed Gaber
|
Damith Premasiri
|
Fiona Anting Tan
|
Lasitha Randunu Chandrakantha Uyangodage
Proceedings of the First Workshop on Language Models for Low-Resource Languages
The first Workshop on Language Models for Low-Resource Languages (LoResLM 2025) was held in conjunction with the 31st International Conference on Computational Linguistics (COLING 2025) in Abu Dhabi, United Arab Emirates. This workshop mainly aimed to provide a forum for researchers to share and discuss their ongoing work on language models (LMs) focusing on low-resource languages, following the recent advancements in neural language models and their linguistic biases towards high-resource languages. LoResLM 2025 attracted notable interest from the natural language processing (NLP) community, resulting in 35 accepted papers from 52 submissions. These contributions cover a broad range of low-resource languages from eight language families and 13 diverse research areas, paving the way for future possibilities and promoting linguistic inclusivity in NLP.
2024
NSina: A News Corpus for Sinhala
Hansi Hettiarachchi
|
Damith Premasiri
|
Lasitha Randunu Chandrakantha Uyangodage
|
Tharindu Ranasinghe
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSina, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSina aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSina is the largest news corpus for Sinhala, available up to date.
Search
Fix data
Co-authors
- Hansi Hettiarachchi 2
- Damith Premasiri 2
- Tharindu Ranasinghe 2
- Mohamed Gaber 1
- Ruslan Mitkov 1
- show all...