Code-Mixed text data consists of sentences having words or phrases from more than one language. Most multi-lingual communities worldwide communicate using multiple lan- guages, with English usually one of them. Hinglish is a Code-Mixed text composed of Hindi and English but written in Roman script. This paper aims to determine the factors in- fluencing the quality of Code-Mixed text data generated by the system. For the Hingli- shEval task, the proposed model uses multi- lingual BERT to find the similarity between synthetically generated and human-generated sentences to predict the quality of synthetically generated Hinglish sentences.
Offensive speech identification in countries like India poses several challenges due to the usage of code-mixed and romanized variants of multiple languages by the users in their posts on social media. The challenge of offensive language identification on social media for Dravidian languages is harder, considering the low resources available for the same. In this paper, we explored the zero-shot learning and few-shot learning paradigms based on multilingual language models for offensive speech detection in code-mixed and romanized variants of three Dravidian languages - Malayalam, Tamil, and Kannada. We propose a novel and flexible approach of selective translation and transliteration to reap better results from fine-tuning and ensembling multilingual transformer networks like XLMRoBERTa and mBERT. We implemented pretrained, fine-tuned, and ensembled versions of XLM-RoBERTa for offensive speech classification. Further, we experimented with interlanguage, inter-task, and multi-task transfer learning techniques to leverage the rich resources available for offensive speech identification in the English language and to enrich the models with knowledge transfer from related tasks. The proposed models yielded good results and are promising for effective offensive speech identification in low resource settings.
This paper presents the methodologies implemented while classifying Dravidian code-mixed comments according to their polarity. With datasets of code-mixed Tamil and Malayalam available, three methods are proposed - a sub-word level model, a word embedding based model and a machine learning based architecture. The sub-word and word embedding based models utilized Long Short Term Memory (LSTM) network along with language-specific preprocessing while the machine learning model used term frequency–inverse document frequency (TF-IDF) vectorization along with a Logistic Regression model. The sub-word level model was submitted to the the track ‘Sentiment Analysis for Dravidian Languages in Code-Mixed Text’ proposed by Forum of Information Retrieval Evaluation in 2020 (FIRE 2020). Although it received a rank of 5 and 12 for the Tamil and Malayalam tasks respectively in the FIRE 2020 track, this paper improves upon the results by a margin to attain final weighted F1-scores of 0.65 for the Tamil task and 0.68 for the Malayalam task. The former score is equivalent to that attained by the highest ranked team of the Tamil track.
This paper describes the team (“Tamalli”)’s submission to AmericasNLP2021 shared task on Open Machine Translation for low resource South American languages. Our goal was to evaluate different Machine Translation (MT) techniques, statistical and neural-based, under several configuration settings. We obtained the second-best results for the language pairs “Spanish-Bribri”, “Spanish-Asháninka”, and “Spanish-Rarámuri” in the category “Development set not used for training”. Our performed experiments will serve as a point of reference for researchers working on MT with low-resource languages.
This paper provides the description of shared tasks to the WAT 2021 by our team “NLPHut”. We have participated in the English→Hindi Multimodal translation task, English→Malayalam Multimodal translation task, and Indic Multi-lingual translation task. We have used the state-of-the-art Transformer model with language tags in different settings for the translation task and proposed a novel “region-specific” caption generation approach using a combination of image CNN and LSTM for the Hindi and Malayalam image captioning. Our submission tops in English→Malayalam Multimodal translation task (text-only translation, and Malayalam caption), and ranks second-best in English→Hindi Multimodal translation task (text-only translation, and Hindi caption). Our submissions have also performed well in the Indic Multilingual translation tasks.
Recent work on automatic sequential metaphor detection has involved recurrent neural networks initialized with different pre-trained word embeddings and which are sometimes combined with hand engineered features. To capture lexical and orthographic information automatically, in this paper we propose to add character based word representation. Also, to contrast the difference between literal and contextual meaning, we utilize a similarity network. We explore these components via two different architectures - a BiLSTM model and a Transformer Encoder model similar to BERT to perform metaphor identification. We participate in the Second Shared Task on Metaphor Detection on both the VUA and TOFEL datasets with the above models. The experimental results demonstrate the effectiveness of our method as it outperforms all the systems which participated in the previous shared task.