How do different factors Impact the Inter-language Similarity? A Case Study on Indian languages

Sourav Kumar, Salil Aggarwal, Dipti Misra Sharma, Radhika Mamidi


Abstract
India is one of the most linguistically diverse nations of the world and is culturally very rich. Most of these languages are somewhat similar to each other on account of sharing a common ancestry or being in contact for a long period of time. Nowadays, researchers are constantly putting efforts in utilizing the language relatedness to improve the performance of various NLP systems such as cross lingual semantic search, machine translation, sentiment analysis systems, etc. So in this paper, we performed an extensive case study on similarity involving languages of the Indian subcontinent. Language similarity prediction is defined as the task of measuring how similar the two languages are on the basis of their lexical, morphological and syntactic features. In this study, we concentrate only on the approach to calculate lexical similarity between Indian languages by looking at various factors such as size and type of corpus, similarity algorithms, subword segmentation, etc. The main takeaways from our work are: (i) Relative order of the language similarities largely remain the same, regardless of the factors mentioned above, (ii) Similarity within the same language family is higher, (iii) Languages share more lexical features at the subword level.
Anthology ID:
2021.acl-srw.12
Volume:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
Month:
August
Year:
2021
Address:
Online
Editors:
Jad Kabbara, Haitao Lin, Amandalynne Paullada, Jannis Vamvas
Venues:
ACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
112–118
Language:
URL:
https://aclanthology.org/2021.acl-srw.12
DOI:
10.18653/v1/2021.acl-srw.12
Bibkey:
Cite (ACL):
Sourav Kumar, Salil Aggarwal, Dipti Misra Sharma, and Radhika Mamidi. 2021. How do different factors Impact the Inter-language Similarity? A Case Study on Indian languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 112–118, Online. Association for Computational Linguistics.
Cite (Informal):
How do different factors Impact the Inter-language Similarity? A Case Study on Indian languages (Kumar et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.acl-srw.12.pdf
Video:
 https://aclanthology.org/2021.acl-srw.12.mp4