Tamil Lyrics Corpus: Analysis and Experiments

Dhivya Chinnappa, Praveenraj Dhandapani


Abstract
In this paper, we present a new Tamil lyrics corpus extracted from Tamil movies captured across a range of 65 years (1954 to 2019). We present a detailed corpus analysis showing the nature of Tamil lyrics with respect to lyricists and the year which it was written. We also present similar- ity score across different lyricists based on their song lyrics. We present experi- mental results based on the SOTA BERT Tamil models to identify the lyricists of a song. Finally, we present future research directions encouraging researchers to pur- sue Tamil NLP research.
Anthology ID:
2021.dravidianlangtech-1.1
Volume:
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages
Month:
April
Year:
2021
Address:
Kyiv
Venues:
DravidianLangTech | EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–9
Language:
URL:
https://aclanthology.org/2021.dravidianlangtech-1.1
DOI:
Bibkey:
Cite (ACL):
Dhivya Chinnappa and Praveenraj Dhandapani. 2021. Tamil Lyrics Corpus: Analysis and Experiments. In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 1–9, Kyiv. Association for Computational Linguistics.
Cite (Informal):
Tamil Lyrics Corpus: Analysis and Experiments (Chinnappa & Dhandapani, DravidianLangTech 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.dravidianlangtech-1.1.pdf
Software:
 2021.dravidianlangtech-1.1.Software.zip
Dataset:
 2021.dravidianlangtech-1.1.Dataset.zip
Code
 praveenraj0904/tamillyricscorpus