JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features

Hongru Liang, Haozheng Wang, Jun Wang, Shaodi You, Zhe Sun, Jin-Mao Wei, Zhenglu Yang


Abstract
Learning social media content is the basis of many real-world applications, including information retrieval and recommendation systems, among others. In contrast with previous works that focus mainly on single modal or bi-modal learning, we propose to learn social media content by fusing jointly textual, acoustic, and visual information (JTAV). Effective strategies are proposed to extract fine-grained features of each modality, that is, attBiGRU and DCRNN. We also introduce cross-modal fusion and attentive pooling techniques to integrate multi-modal information comprehensively. Extensive experimental evaluation conducted on real-world datasets demonstrate our proposed model outperforms the state-of-the-art approaches by a large margin.
Anthology ID:
C18-1108
Volume:
Proceedings of the 27th International Conference on Computational Linguistics
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1269–1280
Language:
URL:
https://aclanthology.org/C18-1108
DOI:
Bibkey:
Cite (ACL):
Hongru Liang, Haozheng Wang, Jun Wang, Shaodi You, Zhe Sun, Jin-Mao Wei, and Zhenglu Yang. 2018. JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1269–1280, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features (Liang et al., COLING 2018)
Copy Citation:
PDF:
https://aclanthology.org/C18-1108.pdf
Code
 mengshor/JTAV