Aggression Identification and Multi Lingual Word Embeddings

Thiago Galery, Efstathios Charitos, Ye Tian


Abstract
The system presented here took part in the 2018 Trolling, Aggression and Cyberbullying shared task (Forest and Trees team) and uses a Gated Recurrent Neural Network architecture (Cho et al., 2014) in an attempt to assess whether combining pre-trained English and Hindi fastText (Mikolov et al., 2018) word embeddings as a representation of the sequence input would improve classification performance. The motivation for this comes from the fact that the shared task data for English contained many Hindi tokens and therefore some users might be doing code-switching: the alternation between two or more languages in communication. To test this hypothesis, we also aligned Hindi and English vectors using pre-computed SVD matrices that pulls representations from different languages into a common space (Smith et al., 2017). Two conditions were tested: (i) one with standard pre-trained fastText word embeddings where each Hindi word is treated as an OOV token, and (ii) another where word embeddings for Hindi and English are loaded in a common vector space, so Hindi tokens can be assigned a meaningful representation. We submitted the second (i.e., multilingual) system and obtained the scores of 0.531 weighted F1 for the EN-FB dataset and 0.438 weighted F1 for the EN-TW dataset.
Anthology ID:
W18-4409
Volume:
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Editors:
Ritesh Kumar, Atul Kr. Ojha, Marcos Zampieri, Shervin Malmasi
Venue:
TRAC
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
74–79
Language:
URL:
https://aclanthology.org/W18-4409
DOI:
Bibkey:
Cite (ACL):
Thiago Galery, Efstathios Charitos, and Ye Tian. 2018. Aggression Identification and Multi Lingual Word Embeddings. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), pages 74–79, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
Aggression Identification and Multi Lingual Word Embeddings (Galery et al., TRAC 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-4409.pdf