%0 Conference Proceedings %T JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features %A Liang, Hongru %A Wang, Haozheng %A Wang, Jun %A You, Shaodi %A Sun, Zhe %A Wei, Jin-Mao %A Yang, Zhenglu %Y Bender, Emily M. %Y Derczynski, Leon %Y Isabelle, Pierre %S Proceedings of the 27th International Conference on Computational Linguistics %D 2018 %8 August %I Association for Computational Linguistics %C Santa Fe, New Mexico, USA %F liang-etal-2018-jtav %X Learning social media content is the basis of many real-world applications, including information retrieval and recommendation systems, among others. In contrast with previous works that focus mainly on single modal or bi-modal learning, we propose to learn social media content by fusing jointly textual, acoustic, and visual information (JTAV). Effective strategies are proposed to extract fine-grained features of each modality, that is, attBiGRU and DCRNN. We also introduce cross-modal fusion and attentive pooling techniques to integrate multi-modal information comprehensively. Extensive experimental evaluation conducted on real-world datasets demonstrate our proposed model outperforms the state-of-the-art approaches by a large margin. %U https://aclanthology.org/C18-1108 %P 1269-1280