2021
pdf
bib
abs
JUNLP@DravidianLangTech-EACL2021: Offensive Language Identification in Dravidian Langauges
Avishek Garain
|
Atanu Mandal
|
Sudip Kumar Naskar
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages
Offensive language identification has been an active area of research in natural language processing. With the emergence of multiple social media platforms offensive language identification has emerged as a need of the hour. Traditional offensive language identification models fail to deliver acceptable results as social media contents are largely in multilingual and are code-mixed in nature. This paper tries to resolve this problem by using IndicBERT and BERT architectures, to facilitate identification of offensive languages for Kannada-English, Malayalam-English, and Tamil-English code-mixed language pairs extracted from social media. The presented approach when evaluated on the test corpus provided precision, recall, and F1 score for language pair Kannada-English as 0.62, 0.71, and 0.66, respectively, for language pair Malayalam-English as 0.77, 0.43, and 0.53, respectively, and for Tamil-English as 0.71, 0.74, and 0.72, respectively.
2020
pdf
bib
abs
JUNLP at SemEval-2020 Task 9: Sentiment Analysis of Hindi-English Code Mixed Data Using Grid Search Cross Validation
Avishek Garain
|
Sainik Mahata
|
Dipankar Das
Proceedings of the Fourteenth Workshop on Semantic Evaluation
Code-mixing is a phenomenon which arises mainly in multilingual societies. Multilingual people, who are well versed in their native languages and also English speakers, tend to code-mix using English-based phonetic typing and the insertion of anglicisms in their main language. This linguistic phenomenon poses a great challenge to conventional NLP domains such as Sentiment Analysis, Machine Translation, and Text Summarization, to name a few. In this work, we focus on working out a plausible solution to the domain of Code-Mixed Sentiment Analysis. This work was done as participation in the SemEval-2020 Sentimix Task, where we focused on the sentiment analysis of English-Hindi code-mixed sentences. our username for the submission was “sainik.mahata” and team name was “JUNLP”. We used feature extraction algorithms in conjunction with traditional machine learning algorithms such as SVR and Grid Search in an attempt to solve the task. Our approach garnered an f1-score of 66.2% when tested using metrics prepared by the organizers of the task.
pdf
bib
abs
Garain at SemEval-2020 Task 12: Sequence Based Deep Learning for Categorizing Offensive Language in Social Media
Avishek Garain
Proceedings of the Fourteenth Workshop on Semantic Evaluation
SemEval-2020 Task 12 was OffenseEval: Multilingual Offensive Language Identification inSocial Media (Zampieri et al., 2020). The task was subdivided into multiple languages anddatasets were provided for each one. The task was further divided into three sub-tasks: offensivelanguage identification, automatic categorization of offense types, and offense target identification.I participated in the task-C, that is, offense target identification. For preparing the proposed system,I made use of Deep Learning networks like LSTMs and frameworks like Keras which combine thebag of words model with automatically generated sequence based features and manually extractedfeatures from the given dataset. My system on training on 25% of the whole dataset achieves macro averaged f1 score of 47.763%.
2019
pdf
bib
abs
The Titans at SemEval-2019 Task 5: Detection of hate speech against immigrants and women in Twitter
Avishek Garain
|
Arpan Basu
Proceedings of the 13th International Workshop on Semantic Evaluation
This system paper is a description of the system submitted to ”SemEval-2019 Task 5” Task B for the English language, where we had to primarily detect hate speech and then detect aggressive behaviour and its target audience in Twitter. There were two specific target audiences, immigrants and women. The language of the tweets was English. We were required to first detect whether a tweet is containing hate speech. Thereafter we were required to find whether the tweet was showing aggressive behaviour, and then we had to find whether the targeted audience was an individual or a group of people.
pdf
bib
abs
The Titans at SemEval-2019 Task 6: Offensive Language Identification, Categorization and Target Identification
Avishek Garain
|
Arpan Basu
Proceedings of the 13th International Workshop on Semantic Evaluation
This system paper is a description of the system submitted to “SemEval-2019 Task 6”, where we had to detect offensive language in Twitter. There were two specific target audiences, immigrants and women. The language of the tweets was English. We were required to first detect whether a tweet contains offensive content, and then we had to find out whether the tweet was targeted against some individual, group or other entity. Finally we were required to classify the targeted audience.
pdf
bib
abs
JUMT at WMT2019 News Translation Task: A Hybrid Approach to Machine Translation for Lithuanian to English
Sainik Kumar Mahata
|
Avishek Garain
|
Adityar Rayala
|
Dipankar Das
|
Sivaji Bandyopadhyay
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
In the current work, we present a description of the system submitted to WMT 2019 News Translation Shared task. The system was created to translate news text from Lithuanian to English. To accomplish the given task, our system used a Word Embedding based Neural Machine Translation model to post edit the outputs generated by a Statistical Machine Translation model. The current paper documents the architecture of our model, descriptions of the various modules and the results produced using the same. Our system garnered a BLEU score of 17.6.