Building a Linguistic Resource : A Word Frequency List for Sinhala

Aloka Fernando, Gihan Dias


Abstract
A word frequency list is a list of unique words in a language along with their frequency count. It is generally sorted by frequency. Such a list is essential for many NLP tasks, including building language models, POS taggers, spelling checkers, word separation guides, etc., in addition to assisting language learners. Such lists are available for many languages, but a large-scale word list is still not available for Sinhala. We have developed a comprehensive list of words, together with their frequency and part-of-speech (POS), from a large textbase. Unlike many other such lists, our list includes a large number of low-frequency words (many of which are erroneous), which enables the analysis of such words, including the frequencies of errors. In addition to the main list, we have also prepared a list of linguistically verified words. The word frequency list and the verified word list are the largest collections of words lists that are available for the Sinhala language.
Anthology ID:
2021.icon-main.74
Volume:
Proceedings of the 18th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2021
Address:
National Institute of Technology Silchar, Silchar, India
Editors:
Sivaji Bandyopadhyay, Sobha Lalitha Devi, Pushpak Bhattacharyya
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
606–610
Language:
URL:
https://aclanthology.org/2021.icon-main.74
DOI:
Bibkey:
Cite (ACL):
Aloka Fernando and Gihan Dias. 2021. Building a Linguistic Resource : A Word Frequency List for Sinhala. In Proceedings of the 18th International Conference on Natural Language Processing (ICON), pages 606–610, National Institute of Technology Silchar, Silchar, India. NLP Association of India (NLPAI).
Cite (Informal):
Building a Linguistic Resource : A Word Frequency List for Sinhala (Fernando & Dias, ICON 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.icon-main.74.pdf