Konidioms Corpus: A Dataset of Idioms in Konkani Language

Naziya Mahamdul Shaikh; Jyoti Pawar; Mubarak Banu Sayed

Konidioms Corpus: A Dataset of Idioms in Konkani Language

Naziya Mahamdul Shaikh, Jyoti D. Pawar, Mubarak Banu Sayed

Abstract

Konkani is a language spoken by a large number of people from the states located in the west coast of India. It is the official language of Goa state from the Indian subcontinent. Currently there is a lack of idioms corpus in the low-resource Konkani language. This paper aims to improve the progress in idiomatic sentence identification in order to enhance linguistic processing by creating the first corpus for idioms in the Konkani language. We select a unique list of 1597 idioms from multiple sources and proceed with a strictly controlled sentence creation procedure through crowdsourcing. This is followed by quality check of the sentences and annotation procedure by the experts in the Konkani language. We were able to build a good quality corpus comprising of 6520 sentences written in the Devanagari script of Konkani language. Analysis of the collected idioms and their usage in the created sentences revealed the dominance of selective domains like ‘human body’ in the creation and occurrences of idiomatic expressions in the Konkani language. This corpus is made publicly available.

Anthology ID:: 2024.lrec-main.867
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 9932–9940
Language:
URL:: https://aclanthology.org/2024.lrec-main.867/
DOI:
Bibkey:
Cite (ACL):: Naziya Mahamdul Shaikh, Jyoti D. Pawar, and Mubarak Banu Sayed. 2024. Konidioms Corpus: A Dataset of Idioms in Konkani Language. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9932–9940, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Konidioms Corpus: A Dataset of Idioms in Konkani Language (Shaikh et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.867.pdf
Optionalsupplementarymaterial:: 2024.lrec-main.867.OptionalSupplementaryMaterial.zip

PDF Cite Search Optionalsupplementarymaterial Fix data