%0 Conference Proceedings
%T JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features
%A Liang, Hongru
%A Wang, Haozheng
%A Wang, Jun
%A You, Shaodi
%A Sun, Zhe
%A Wei, Jin-Mao
%A Yang, Zhenglu
%Y Bender, Emily M.
%Y Derczynski, Leon
%Y Isabelle, Pierre
%S Proceedings of the 27th International Conference on Computational Linguistics
%D 2018
%8 August
%I Association for Computational Linguistics
%C Santa Fe, New Mexico, USA
%F liang-etal-2018-jtav
%X Learning social media content is the basis of many real-world applications, including information retrieval and recommendation systems, among others. In contrast with previous works that focus mainly on single modal or bi-modal learning, we propose to learn social media content by fusing jointly textual, acoustic, and visual information (JTAV). Effective strategies are proposed to extract fine-grained features of each modality, that is, attBiGRU and DCRNN. We also introduce cross-modal fusion and attentive pooling techniques to integrate multi-modal information comprehensively. Extensive experimental evaluation conducted on real-world datasets demonstrate our proposed model outperforms the state-of-the-art approaches by a large margin.
%U https://aclanthology.org/C18-1108
%P 1269-1280