TeluguNER: Leveraging Multi-Domain Named Entity Recognition with Deep Transformers

Suma Reddy Duggenpudi, Subba Reddy Oota, Mounika Marreddy, Radhika Mamidi


Abstract
Named Entity Recognition (NER) is a successful and well-researched problem in English due to the availability of resources. The transformer models, specifically the masked-language models (MLM), have shown remarkable performance in NER during recent times. With growing data in different online platforms, there is a need for NER in other languages too. NER remains to be underexplored in Indian languages due to the lack of resources and tools. Our contributions in this paper include (i) Two annotated NER datasets for the Telugu language in multiple domains: Newswire Dataset (ND) and Medical Dataset (MD), and we combined ND and MD to form Combined Dataset (CD) (ii) Comparison of the finetuned Telugu pretrained transformer models (BERT-Te, RoBERTa-Te, and ELECTRA-Te) with other baseline models (CRF, LSTM-CRF, and BiLSTM-CRF) (iii) Further investigation of the performance of Telugu pretrained transformer models against the multilingual models mBERT, XLM-R, and IndicBERT. We find that pretrained Telugu language models (BERT-Te and RoBERTa) outperform the existing pretrained multilingual and baseline models in NER. On a large dataset (CD) of 38,363 sentences, the BERT-Te achieves a high F1-score of 0.80 (entity-level) and 0.75 (token-level). Further, these pretrained Telugu models have shown state-of-the-art performance on various existing Telugu NER datasets. We open-source our dataset, pretrained models, and code.
Anthology ID:
2022.acl-srw.20
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Samuel Louvan, Andrea Madotto, Brielen Madureira
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
262–272
Language:
URL:
https://aclanthology.org/2022.acl-srw.20
DOI:
10.18653/v1/2022.acl-srw.20
Bibkey:
Cite (ACL):
Suma Reddy Duggenpudi, Subba Reddy Oota, Mounika Marreddy, and Radhika Mamidi. 2022. TeluguNER: Leveraging Multi-Domain Named Entity Recognition with Deep Transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 262–272, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
TeluguNER: Leveraging Multi-Domain Named Entity Recognition with Deep Transformers (Duggenpudi et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-srw.20.pdf
Video:
 https://aclanthology.org/2022.acl-srw.20.mp4
Data
WikiANN