Jiayuan Mao


pdf bib
Language-Mediated, Object-Centric Representation Learning
Ruocheng Wang | Jiayuan Mao | Samuel Gershman | Jiajun Wu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Proceedings of the Fourth Workshop on Visually Grounded Interaction and Language
Cătălina Cangea | Abhishek Das | Drew Hudson | Jacob Krantz | Stefan Lee | Jiayuan Mao | Florian Strub | Alane Suhr | Erik Wijmans
Proceedings of the Fourth Workshop on Visually Grounded Interaction and Language


pdf bib
Visually Grounded Neural Syntax Acquisition
Haoyue Shi | Jiayuan Mao | Kevin Gimpel | Karen Livescu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present the Visually Grounded Neural Syntax Learner (VG-NSL), an approach for learning syntactic representations and structures without any explicit supervision. The model learns by looking at natural images and reading paired captions. VG-NSL generates constituency parse trees of texts, recursively composes representations for constituents, and matches them with images. We define concreteness of constituents by their matching scores with images, and use it to guide the parsing of text. Experiments on the MSCOCO data set show that VG-NSL outperforms various unsupervised parsing approaches that do not use visual grounding, in terms of F1 scores against gold parse trees. We find that VGNSL is much more stable with respect to the choice of random initialization and the amount of training data. We also find that the concreteness acquired by VG-NSL correlates well with a similar measure defined by linguists. Finally, we also apply VG-NSL to multiple languages in the Multi30K data set, showing that our model consistently outperforms prior unsupervised approaches.


pdf bib
Learning Visually-Grounded Semantics from Contrastive Adversarial Samples
Haoyue Shi | Jiayuan Mao | Tete Xiao | Yuning Jiang | Jian Sun
Proceedings of the 27th International Conference on Computational Linguistics

We study the problem of grounding distributional representations of texts on the visual domain, namely visual-semantic embeddings (VSE for short). Begin with an insightful adversarial attack on VSE embeddings, we show the limitation of current frameworks and image-text datasets (e.g., MS-COCO) both quantitatively and qualitatively. The large gap between the number of possible constitutions of real-world semantics and the size of parallel data, to a large extent, restricts the model to establish a strong link between textual semantics and visual concepts. We alleviate this problem by augmenting the MS-COCO image captioning datasets with textual contrastive adversarial samples. These samples are synthesized using language priors of human and the WordNet knowledge base, and enforce the model to ground learned embeddings to concrete concepts within the image. This simple but powerful technique brings a noticeable improvement over the baselines on a diverse set of downstream tasks, in addition to defending known-type adversarial attacks. Codes are available at https://github.com/ExplorerFreda/VSE-C.