Anmol Goel


2022

pdf bib
SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing
Prashant Kodali | Anmol Goel | Monojit Choudhury | Manish Shrivastava | Ponnurangam Kumaraguru
Findings of the Association for Computational Linguistics: ACL 2022

Code mixing is the linguistic phenomenon where bilingual speakers tend to switch between two or more languages in conversations. Recent work on code-mixing in computational settings has leveraged social media code mixed texts to train NLP models. For capturing the variety of code mixing in, and across corpus, Language ID (LID) tags based measures (CMI) have been proposed. Syntactical variety/patterns of code-mixing and their relationship vis-a-vis computational model’s performance is under explored. In this work, we investigate a collection of English(en)-Hindi(hi) code-mixed datasets from a syntactic lens to propose, SyMCoM, an indicator of syntactic variety in code-mixed text, with intuitive theoretical bounds. We train SoTA en-hi PoS tagger, accuracy of 93.4%, to reliably compute PoS tags on a corpus, and demonstrate the utility of SyMCoM by applying it on various syntactical categories on a collection of datasets, and compare datasets using the measure.

pdf bib
HLDC: Hindi Legal Documents Corpus
Arnav Kapoor | Mudit Dhawan | Anmol Goel | Arjun T H | Akshala Bhatnagar | Vibhu Agrawal | Amul Agrawal | Arnab Bhattacharya | Ponnurangam Kumaraguru | Ashutosh Modi
Findings of the Association for Computational Linguistics: ACL 2022

Many populous countries including India are burdened with a considerable backlog of legal cases. Development of automated systems that could process legal documents and augment legal practitioners can mitigate this. However, there is a dearth of high-quality corpora that is needed to develop such data-driven systems. The problem gets even more pronounced in the case of low resource languages such as Hindi. In this resource paper, we introduce the Hindi Legal Documents Corpus (HLDC), a corpus of more than 900K legal documents in Hindi. Documents are cleaned and structured to enable the development of downstream applications. Further, as a use-case for the corpus, we introduce the task of bail prediction. We experiment with a battery of models and propose a Multi-Task Learning (MTL) based model for the same. MTL models use summarization as an auxiliary task along with bail prediction as the main task. Experiments with different models are indicative of the need for further research in this area.

2021

pdf bib
CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences
Devansh Gautam | Prashant Kodali | Kshitij Gupta | Anmol Goel | Manish Shrivastava | Ponnurangam Kumaraguru
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Code-mixed languages are very popular in multilingual societies around the world, yet the resources lag behind to enable robust systems on such languages. A major contributing factor is the informal nature of these languages which makes it difficult to collect code-mixed data. In this paper, we propose our system for Task 1 of CACLS 2021 to generate a machine translation system for English to Hinglish in a supervised setting. Translating in the given direction can help expand the set of resources for several tasks by translating valuable datasets from high resource languages. We propose to use mBART, a pre-trained multilingual sequence-to-sequence model, and fully utilize the pre-training of the model by transliterating the roman Hindi words in the code-mixed sentences to Devanagri script. We evaluate how expanding the input by concatenating Hindi translations of the English sentences improves mBART’s performance. Our system gives a BLEU score of 12.22 on test set. Further, we perform a detailed error analysis of our proposed systems and explore the limitations of the provided dataset and metrics.