Sentiment Analysis for Hinglish Code-mixed Tweets by means of Cross-lingual Word Embeddings

Pranaydeep Singh, Els Lefever


Abstract
This paper investigates the use of unsupervised cross-lingual embeddings for solving the problem of code-mixed social media text understanding. We specifically investigate the use of these embeddings for a sentiment analysis task for Hinglish Tweets, viz. English combined with (transliterated) Hindi. In a first step, baseline models, initialized with monolingual embeddings obtained from large collections of tweets in English and code-mixed Hinglish, were trained. In a second step, two systems using cross-lingual embeddings were researched, being (1) a supervised classifier and (2) a transfer learning approach trained on English sentiment data and evaluated on code-mixed data. We demonstrate that incorporating cross-lingual embeddings improves the results (F1-score of 0.635 versus a monolingual baseline of 0.616), without any parallel data required to train the cross-lingual embeddings. In addition, the results show that the cross-lingual embeddings not only improve the results in a fully supervised setting, but they can also be used as a base for distant supervision, by training a sentiment model in one of the source languages and evaluating on the other language projected in the same space. The transfer learning experiments result in an F1-score of 0.556, which is almost on par with the supervised settings and speak to the robustness of the cross-lingual embeddings approach.
Anthology ID:
2020.calcs-1.6
Volume:
Proceedings of the 4th Workshop on Computational Approaches to Code Switching
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Thamar Solorio, Monojit Choudhury, Kalika Bali, Sunayana Sitaram, Amitava Das, Mona Diab
Venue:
CALCS
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
45–51
Language:
English
URL:
https://aclanthology.org/2020.calcs-1.6
DOI:
Bibkey:
Cite (ACL):
Pranaydeep Singh and Els Lefever. 2020. Sentiment Analysis for Hinglish Code-mixed Tweets by means of Cross-lingual Word Embeddings. In Proceedings of the 4th Workshop on Computational Approaches to Code Switching, pages 45–51, Marseille, France. European Language Resources Association.
Cite (Informal):
Sentiment Analysis for Hinglish Code-mixed Tweets by means of Cross-lingual Word Embeddings (Singh & Lefever, CALCS 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.calcs-1.6.pdf