Visually Grounded Neural Syntax Acquisition

Haoyue Shi, Jiayuan Mao, Kevin Gimpel, Karen Livescu


Abstract
We present the Visually Grounded Neural Syntax Learner (VG-NSL), an approach for learning syntactic representations and structures without any explicit supervision. The model learns by looking at natural images and reading paired captions. VG-NSL generates constituency parse trees of texts, recursively composes representations for constituents, and matches them with images. We define concreteness of constituents by their matching scores with images, and use it to guide the parsing of text. Experiments on the MSCOCO data set show that VG-NSL outperforms various unsupervised parsing approaches that do not use visual grounding, in terms of F1 scores against gold parse trees. We find that VGNSL is much more stable with respect to the choice of random initialization and the amount of training data. We also find that the concreteness acquired by VG-NSL correlates well with a similar measure defined by linguists. Finally, we also apply VG-NSL to multiple languages in the Multi30K data set, showing that our model consistently outperforms prior unsupervised approaches.
Anthology ID:
P19-1180
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1842–1861
Language:
URL:
https://aclanthology.org/P19-1180
DOI:
10.18653/v1/P19-1180
Bibkey:
Cite (ACL):
Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu. 2019. Visually Grounded Neural Syntax Acquisition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1842–1861, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Visually Grounded Neural Syntax Acquisition (Shi et al., ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/P19-1180.pdf
Video:
 https://vimeo.com/384515583
Data
COCOPenn Treebank